Jump to content

Scrapping of web page not working


rtadams89

Recommended Posts

Ultimately, I'm trying to get a list of all the "Individual names" that show up on this page: https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1

 

To get the contents of that page (for later parsing with regex), I first tried to use file_get_contents(), but it seems to return the contents of a "Object Moved" error page. Looking at the source code of the page I am trying to scrap, it looks like it uses JavaScript to submit a form post before the data is shown, I image this is why an error page is returned when I attempt to use file_get_contents().

 

For my second attempt, I used this function which I found on the PHP.net comments page for file_get_contents().

function http_post ($url, $data)
{
    $data_url = http_build_query ($data);
    $data_len = strlen ($data_url);

    return array ('content'=>file_get_contents ($url, false, stream_context_create (array ('http'=>array ('method'=>'POST'
            , 'header'=>"Connection: close\r\nContent-Length: $data_len\r\n"
            , 'content'=>$data_url
            ))))
        , 'headers'=>$http_response_header
        );
}

It too returned the same error page.

 

I would appreciate some help in getting the page data of the URL above into a PHP variable so that I can process it further.

Link to comment
https://forums.phpfreaks.com/topic/230182-scrapping-of-web-page-not-working/
Share on other sites

I got a bit further using this code:

<?php 

       // create curl resource 
        $ch = curl_init(); 

        // set url 
        curl_setopt($ch, CURLOPT_URL, "https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1"); 

        //return the transfer as a string 
        $userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)";
        curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 20);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

        // $output contains the output string 
        $output = curl_exec($ch); 

        // close curl resource to free up system resources 
        curl_close($ch);      
        
        echo $output;
        

?>

 

I at least get some of the page, but a major chunk (with the info I need) in the middle is missing. I think this is due to the use of JavaScript on the page, but I'm not sure what the javascript is doing, so I don't know where to go from here...

Nah it doesn't get those via AJAX but the site does seem to block you if you don't have a valid browser user agent.

This worked for me to retrieve all the names (plus some minor treatment):

 

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
   }
   
   print_r($names[1]);
?>
</pre>

Change your code to this (some further cleanup of the list):

 

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as $n => &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
      
      if (stripos($name, 'Secretary') !== false || stripos($name, 'Results') !== false)
         unset($names[1][$n]);
   }
   
   $names[1] = array_unique($names[1]);
   
   print_r($names[1]);
?>
</pre>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.