Jump to content

Scraping Search Results with cURL and PHP


stabnsprint

Recommended Posts

Hi there, I'm relatively new to PHP and was wondering if you guys could help me out.

 

I'm trying to write some PHP code that performs a search on Google given certain keywords and returns all of the links on the search result page. Right now, I'm using cURL to query the site and then DOM and XPath to parse the HTML and give me the links. Here is the code:

Line number On/Off | Expand/Contract

 

  1. 

  2. <?php

  3. 

  4. class scraper_google extends scraper_base

  5. {

  6.    public $dom;

  7.    public $hrefs;

  8.   

  9.    public function init($keywords)

  10.    {

  11.        $this->keywords = $keywords;

  12.       

  13.        $this->target_url = 'http://www.google.com/#hl=en&q='

  14.                                .$keywords[0].'&aq=f&oq=&aqi=g10&fp=ADrf44LAAa8';

  15.        echo $this->target_url;

  16.        $this->search_engine = 'www.google.com';

  17.        $this->userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

  18.    }

  19.    public function parse_results()

  20.    {

  21.        // make the cURL request to $target_url

  22.        $ch = curl_init();

  23.        curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);

  24.        curl_setopt($ch, CURLOPT_URL,$this->target_url);

  25.        curl_setopt($ch, CURLOPT_FAILONERROR, true);

  26.        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

  27.        curl_setopt($ch, CURLOPT_AUTOREFERER, true);

  28.        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);

  29.        curl_setopt($ch, CURLOPT_TIMEOUT, 10);

  30.        $html= curl_exec($ch);

  31.        if (!$html)

  32.        {

  33.            echo "<br />cURL error number:" .curl_errno($ch);

  34.            echo "<br />cURL error:" . curl_error($ch);

  35.            exit;

  36.        }

  37. 

  38.        // parse the html into a DOMDocument

  39.        $dom = new DOMDocument();

  40.        @$dom->loadHTML($html);

  41. 

  42.        // grab all the on the page

  43.        $xpath = new DOMXPath($dom);

  44.        $this->hrefs = $xpath->evaluate("/html//a");

  45.    }

  46.    public function display_results()

  47.    {

  48.        for ($i = 0; $i < $this->hrefs->length; $i++)

  49.        {

  50.            $href = $this->hrefs->item($i);

  51.            $url = $href->getAttribute('href');

  52.            echo "<br />Link stored: $url";

  53.        }

  54.    }

  55. 

  56. }

  57. 

  58. ?>

  59. 

 

 

 

And this is the script that implements it:

 

<?php

 

require_once('__root.inc.php');

 

 

$scraper = new scraper_google();

$scraper->keywords[0] = "keyword";

$scraper->init($scraper->keywords);

$scraper->parse_results();

$scraper->display_results();

 

?>

 

 

Feel free to try it out yourself. The problem that I'm having is that it gets to the page but is only able to read the header of the result page (with the Google bar up top along with the image, video, and blog search links. I'm guessing the reason for this is because Google AJAXs the search result after the page loads so my question is, is there any way to have access to and parse the page after the search results are displayed?

 

Thank you.

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.