Scraping Search Results with cURL and PHP

stabnsprint · July 10, 2009

Hi there, I'm relatively new to PHP and was wondering if you guys could help me out.

I'm trying to write some PHP code that performs a search on Google given certain keywords and returns all of the links on the search result page. Right now, I'm using cURL to query the site and then DOM and XPath to parse the HTML and give me the links. Here is the code:

Line number On/Off | Expand/Contract

1.

2. <?php

3.

4. class scraper_google extends scraper_base

5. {

6. public $dom;

7. public $hrefs;

8.

9. public function init($keywords)

10. {

11. $this->keywords = $keywords;

12.

13. $this->target_url = 'http://www.google.com/#hl=en&q='

14. .$keywords[0].'&aq=f&oq=&aqi=g10&fp=ADrf44LAAa8';

15. echo $this->target_url;

16. $this->search_engine = 'www.google.com';

17. $this->userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

18. }

19. public function parse_results()

20. {

21. // make the cURL request to $target_url

22. $ch = curl_init();

23. curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);

24. curl_setopt($ch, CURLOPT_URL,$this->target_url);

25. curl_setopt($ch, CURLOPT_FAILONERROR, true);

26. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

27. curl_setopt($ch, CURLOPT_AUTOREFERER, true);

28. curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);

29. curl_setopt($ch, CURLOPT_TIMEOUT, 10);

30. $html= curl_exec($ch);

31. if (!$html)

32. {

33. echo "<br />cURL error number:" .curl_errno($ch);

34. echo "<br />cURL error:" . curl_error($ch);

35. exit;

36. }

37.

38. // parse the html into a DOMDocument

39. $dom = new DOMDocument();

40. @$dom->loadHTML($html);

41.

42. // grab all the on the page

43. $xpath = new DOMXPath($dom);

44. $this->hrefs = $xpath->evaluate("/html//a");

45. }

46. public function display_results()

47. {

48. for ($i = 0; $i < $this->hrefs->length; $i++)

49. {

50. $href = $this->hrefs->item($i);

51. $url = $href->getAttribute('href');

52. echo "<br />Link stored: $url";

53. }

54. }

55.

56. }

57.

58. ?>

59.

And this is the script that implements it:

<?php

require_once('__root.inc.php');

$scraper = new scraper_google();

$scraper->keywords[0] = "keyword";

$scraper->init($scraper->keywords);

$scraper->parse_results();

$scraper->display_results();

?>

Feel free to try it out yourself. The problem that I'm having is that it gets to the page but is only able to read the header of the result page (with the Google bar up top along with the image, video, and blog search links. I'm guessing the reason for this is because Google AJAXs the search result after the page loads so my question is, is there any way to have access to and parse the page after the search results are displayed?

Thank you.

Sign In

Scraping Search Results with cURL and PHP

Recommended Posts

stabnsprint

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information