Jump to content

PHP & cURL


PutterPlace

Recommended Posts

I have a new project that I would like to work on. It should be pretty simple. All I want to do is pull search results from Google. I would like to keep each result intact. According to my research, each search result shows up with the following code:

 

<div class=g>
  <!--m-->
  <h2 class=r>
    <a href="http://www.web-max.ca/PHP/" class=l onmousedown="return clk(this.href,'','','res','1','')">[TITLE]</a>
  </h2>

  <table border=0 cellpadding=0 cellspacing=0>
    <tr>
      <td class="j">
        <font size=-1>[DESCRIPTION]<br><span class=a>www.URL.com<b>PHP</b>/ - [size]k - </span><nobr><a class=fl href="http://64.233.167.104/search?q=cache:xxxxxxxxxxxx:www.URL.com+search+terms&hl=en&ct=clnk&cd=1&gl=us">Cached</a> - <a class=fl href="/search?hl=en&q=related:www.URL.com">Similar pages</a></nobr></font><!--n-->
      </td>
    </tr>
  </table>
</div>

 

I know how to use cURL to get the Search Results contents. All I need to know now is how to parse the rseults. All I basically need to do is extract everything between <div class=g> and </div>. I need to extract all occurences of this. Any help would be greatly appreciated. :)

Link to comment
Share on other sites

This is what I've got right now:

 

 

 

 

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match_all('/StartTag(.*)EndTag/', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$content = str_replace("<div class=g>","StartTag",$content);

$content = str_replace("</table></div> ","</table>EndTag",$content);

$content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content);

$output = ExtractText($content);

print $output[0];
?>

 

You can see it's output here: http://www.xtremefilez.com/GoogleResults.php

 

The search term used for this test search is "test term" as you can see from the script above. For some reason, the text "EndTagStartTag" remains at the end of each result. Is there anyway that I can fix this?

 

Thanks in advance.

Link to comment
Share on other sites

I just realized that I could do another str_replace to remove the EndTagStartTag. Now my code looks like this:

 

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match_all('/StartTag(.*)EndTag/', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$content = str_replace("<div class=g>","StartTag",$content);

$content = str_replace("</table></div> ","</table>EndTag",$content);

$content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content);

$output = ExtractText($content);

$output = str_replace("EndTagStartTag","",$output[0]);

print $output;
?>

 

Now that I've gotten that thing out of the way. Can anyone suggest how I could modify the code to only keep the title/link of the results? For example:

 

<h2 class=r><a href="http://www.nwea.org/support/details.aspx?content=901" class=l>How to change or close your <b>Test Term</b> Window</a></h2>

 

Is there a simple regular expression I could use to extract everything between all of the <h2> & </h2> tags on the results page?

 

Thanks in advance for any help that you can provide.

Link to comment
Share on other sites

I tried out your suggestion, and came up with the following code:

 

<?php
function get_content($url)
{
   $ch = curl_init();
   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_setopt ($ch, CURLOPT_HEADER, 0);
   ob_start();
   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();
   ob_end_clean();
   return $string;
}

function ExtractText($string) {
    preg_match('%(?<=<h2 class=r>)(.*?)(?=</h2>)%', $string, $m);
    return $m[1];
}

$content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search");

$output = ExtractText($content);

print_r($output);

?>

 

However, this doesn't seem to work for me. Is there something that I'm doing wrong?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.