Scrapping of web page not working

rtadams89 · March 10, 2011

Ultimately, I'm trying to get a list of all the "Individual names" that show up on this page: https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1

To get the contents of that page (for later parsing with regex), I first tried to use file_get_contents(), but it seems to return the contents of a "Object Moved" error page. Looking at the source code of the page I am trying to scrap, it looks like it uses JavaScript to submit a form post before the data is shown, I image this is why an error page is returned when I attempt to use file_get_contents().

For my second attempt, I used this function which I found on the PHP.net comments page for file_get_contents().

function http_post ($url, $data)
{
    $data_url = http_build_query ($data);
    $data_len = strlen ($data_url);

    return array ('content'=>file_get_contents ($url, false, stream_context_create (array ('http'=>array ('method'=>'POST'
            , 'header'=>"Connection: close\r\nContent-Length: $data_len\r\n"
            , 'content'=>$data_url
            ))))
        , 'headers'=>$http_response_header
        );
}

It too returned the same error page.

I would appreciate some help in getting the page data of the URL above into a PHP variable so that I can process it further.

rtadams89 · March 10, 2011

I got a bit further using this code:

<?php 

       // create curl resource 
        $ch = curl_init(); 

        // set url 
        curl_setopt($ch, CURLOPT_URL, "https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1"); 

        //return the transfer as a string 
        $userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)";
        curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 20);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

        // $output contains the output string 
        $output = curl_exec($ch); 

        // close curl resource to free up system resources 
        curl_close($ch);      
        
        echo $output;
        

?>

I at least get some of the page, but a major chunk (with the info I need) in the middle is missing. I think this is due to the use of JavaScript on the page, but I'm not sure what the javascript is doing, so I don't know where to go from here...

silkfire · March 10, 2011

Nah it doesn't get those via AJAX but the site does seem to block you if you don't have a valid browser user agent.

This worked for me to retrieve all the names (plus some minor treatment):

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
   }
   
   print_r($names[1]);
?>
</pre>

rtadams89 · March 10, 2011

Ahh, interesting. That does seem to help. Thanks for the help.

silkfire · March 10, 2011

I'm not sure though what solution you were aiming for? Did I guess correctly?

silkfire · March 10, 2011

Change your code to this (some further cleanup of the list):

<pre>
<?
   $ch = curl_init();

   curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1');
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   
   $data = curl_exec($ch);

   preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names);
   
   foreach($names[1] as $n => &$name) {
      $name = ucwords($name);
      $name = preg_replace('# {2,}#', ' ', $name);
      $name = preg_replace('# (.) #', ' $1. ', $name);
      
      if (stripos($name, 'Secretary') !== false || stripos($name, 'Results') !== false)
         unset($names[1][$n]);
   }
   
   $names[1] = array_unique($names[1]);
   
   print_r($names[1]);
?>
</pre>

Sign In

Scrapping of web page not working

Recommended Posts

rtadams89

Link to comment

Share on other sites

rtadams89

Link to comment

Share on other sites

silkfire

Link to comment

Share on other sites

rtadams89

Link to comment

Share on other sites

silkfire

Link to comment

Share on other sites

silkfire

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information