rtadams89 Posted March 10, 2011 Share Posted March 10, 2011 Ultimately, I'm trying to get a list of all the "Individual names" that show up on this page: https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1 To get the contents of that page (for later parsing with regex), I first tried to use file_get_contents(), but it seems to return the contents of a "Object Moved" error page. Looking at the source code of the page I am trying to scrap, it looks like it uses JavaScript to submit a form post before the data is shown, I image this is why an error page is returned when I attempt to use file_get_contents(). For my second attempt, I used this function which I found on the PHP.net comments page for file_get_contents(). function http_post ($url, $data) { $data_url = http_build_query ($data); $data_len = strlen ($data_url); return array ('content'=>file_get_contents ($url, false, stream_context_create (array ('http'=>array ('method'=>'POST' , 'header'=>"Connection: close\r\nContent-Length: $data_len\r\n" , 'content'=>$data_url )))) , 'headers'=>$http_response_header ); } It too returned the same error page. I would appreciate some help in getting the page data of the URL above into a PHP variable so that I can process it further. Quote Link to comment Share on other sites More sharing options...
rtadams89 Posted March 10, 2011 Author Share Posted March 10, 2011 I got a bit further using this code: <?php // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, "https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1"); //return the transfer as a string $userAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"; curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 20); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // $output contains the output string $output = curl_exec($ch); // close curl resource to free up system resources curl_close($ch); echo $output; ?> I at least get some of the page, but a major chunk (with the info I need) in the middle is missing. I think this is due to the use of JavaScript on the page, but I'm not sure what the javascript is doing, so I don't know where to go from here... Quote Link to comment Share on other sites More sharing options...
silkfire Posted March 10, 2011 Share Posted March 10, 2011 Nah it doesn't get those via AJAX but the site does seem to block you if you don't have a valid browser user agent. This worked for me to retrieve all the names (plus some minor treatment): <pre> <? $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)'); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $data = curl_exec($ch); preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names); foreach($names[1] as &$name) { $name = ucwords($name); $name = preg_replace('# {2,}#', ' ', $name); $name = preg_replace('# (.) #', ' $1. ', $name); } print_r($names[1]); ?> </pre> Quote Link to comment Share on other sites More sharing options...
rtadams89 Posted March 10, 2011 Author Share Posted March 10, 2011 Ahh, interesting. That does seem to help. Thanks for the help. Quote Link to comment Share on other sites More sharing options...
silkfire Posted March 10, 2011 Share Posted March 10, 2011 I'm not sure though what solution you were aiming for? Did I guess correctly? Quote Link to comment Share on other sites More sharing options...
silkfire Posted March 10, 2011 Share Posted March 10, 2011 Change your code to this (some further cleanup of the list): <pre> <? $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'https://membership.usarugby.org/PublicRosterRpt.aspx?ReportID=27343&PrinterFriendly=1'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)'); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $data = curl_exec($ch); preg_match_all("#\[\d+,\'([^']+)\',\[#", $data, $names); foreach($names[1] as $n => &$name) { $name = ucwords($name); $name = preg_replace('# {2,}#', ' ', $name); $name = preg_replace('# (.) #', ' $1. ', $name); if (stripos($name, 'Secretary') !== false || stripos($name, 'Results') !== false) unset($names[1][$n]); } $names[1] = array_unique($names[1]); print_r($names[1]); ?> </pre> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.