Jump to content

Scrapping of web page not working


rtadams89

Recommended Posts

Ultimately, I'm trying to get a list of all the "Individual names" that show up on this page: https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1

 

To get the contents of that page (for later parsing with regex), I first tried to use file_get_contents(), but it seems to return the contents of a "Object Moved" error page. Looking at the source code of the page I am trying to scrap, it looks like it uses JavaScript to submit a form post before the data is shown, I image this is why an error page is returned when I attempt to use file_get_contents().

 

For my second attempt, I used this function which I found on the PHP.net comments page for file_get_contents().

function http_post ($url, $data)
{
    $data_url = http_build_query ($data);
    $data_len = strlen ($data_url);

    return array ('content'=>file_get_contents ($url, false, stream_context_create (array ('http'=>array ('method'=>'POST'
            , 'header'=>"Connection: close\r\nContent-Length: $data_len\r\n"
            , 'content'=>$data_url
            ))))
        , 'headers'=>$http_response_header
        );
}

It too returned the same error page.

 

I would appreciate some help in getting the page data of the URL above into a PHP variable so that I can process it further.

Link to comment
Share on other sites

I got a bit further using this code:

<?php 

       // create curl resource 
        $ch = curl_init(); 

        // set url 
        curl_setopt($ch, CURLOPT_URL, "https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1"); 

        //return the transfer as a string 
        $userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)";
        curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 20);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

        // $output contains the output string 
        $output = curl_exec($ch); 

        // close curl resource to free up system resources 
        curl_close($ch);      
        
        echo $output;
        

?>

 

I at least get some of the page, but a major chunk (with the info I need) in the middle is missing. I think this is due to the use of JavaScript on the page, but I'm not sure what the javascript is doing, so I don't know where to go from here...

Link to comment
Share on other sites

Nah it doesn't get those via AJAX but the site does seem to block you if you don't have a valid browser user agent.

This worked for me to retrieve all the names (plus some minor treatment):

 

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
   }
   
   print_r($names[1]);
?>
</pre>

Link to comment
Share on other sites

Change your code to this (some further cleanup of the list):

 

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as $n => &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
      
      if (stripos($name, 'Secretary') !== false || stripos($name, 'Results') !== false)
         unset($names[1][$n]);
   }
   
   $names[1] = array_unique($names[1]);
   
   print_r($names[1]);
?>
</pre>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.