PHP & cURL

PutterPlace · January 31, 2008

I have a new project that I would like to work on. It should be pretty simple. All I want to do is pull search results from Google. I would like to keep each result intact. According to my research, each search result shows up with the following code:

<div class=g>
  <!--m-->
  <h2 class=r>
    <a href="http://www.web-max.ca/PHP/" class=l onmousedown="return clk(this.href,'','','res','1','')">[TITLE]</a>
  </h2>

  <table border=0 cellpadding=0 cellspacing=0>
    <tr>
      <td class="j">
        <font size=-1>[DESCRIPTION]<br><span class=a>www.URL.com<b>PHP</b>/ - [size]k - </span><nobr><a class=fl href="http://64.233.167.104/search?q=cache:xxxxxxxxxxxx:www.URL.com+search+terms&hl=en&ct=clnk&cd=1&gl=us">Cached</a> - <a class=fl href="/search?hl=en&q=related:www.URL.com">Similar pages</a></nobr></font><!--n-->
      </td>
    </tr>
  </table>
</div>

I know how to use cURL to get the Search Results contents. All I need to know now is how to parse the rseults. All I basically need to do is extract everything between <div class=g> and </div>. I need to extract all occurences of this. Any help would be greatly appreciated.

effigy · January 31, 2008

What have you tried so far?

PutterPlace · February 4, 2008

This is what I've got right now:

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match_all('/StartTag(.*)EndTag/', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$content = str_replace("<div class=g>","StartTag",$content);

$content = str_replace("</table></div> ","</table>EndTag",$content);

$content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content);

$output = ExtractText($content);

print $output[0];
?>

You can see it's output here: http://www.xtremefilez.com/GoogleResults.php

The search term used for this test search is "test term" as you can see from the script above. For some reason, the text "EndTagStartTag" remains at the end of each result. Is there anyway that I can fix this?

Thanks in advance.

PutterPlace · February 4, 2008

I just realized that I could do another str_replace to remove the EndTagStartTag. Now my code looks like this:

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match_all('/StartTag(.*)EndTag/', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$content = str_replace("<div class=g>","StartTag",$content);

$content = str_replace("</table></div> ","</table>EndTag",$content);

$content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content);

$output = ExtractText($content);

$output = str_replace("EndTagStartTag","",$output[0]);

print $output;
?>

Now that I've gotten that thing out of the way. Can anyone suggest how I could modify the code to only keep the title/link of the results? For example:

<h2 class=r><a href="http://www.nwea.org/support/details.aspx?content=901" class=l>How to change or close your <b>Test Term</b> Window</a></h2>

Is there a simple regular expression I could use to extract everything between all of the <h2> & </h2> tags on the results page?

Thanks in advance for any help that you can provide.

effigy · February 4, 2008

preg_match_all('%(?<=<h2 class=r>)(.*?)(?=</h2>)%', $string, $m);

PutterPlace · February 10, 2008

Thanks effigy. You're a big help. I'm going to go try that right now, and then post back with my results.

PutterPlace · February 10, 2008

I tried out your suggestion, and came up with the following code:

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match('%(?<=<h2 class=r>)(.*?)(?=</h2>)%', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$output = ExtractText($content);

print_r($output);

?>

However, this doesn't seem to work for me. Is there something that I'm doing wrong?

PutterPlace · February 10, 2008

Nevermind....I simplified the regular expression and came up with the following:

preg_match_all('/<h2 class=r>(.*?)<\/h2>/', $string, $m);

My code now works perfectly. Thanks for your help effigy.

Sign In

PHP & cURL

Recommended Posts

PutterPlace

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

PutterPlace

Link to comment

Share on other sites

PutterPlace

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

PutterPlace

Link to comment

Share on other sites

PutterPlace

Link to comment

Share on other sites

PutterPlace

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information