PutterPlace Posted January 31, 2008 Share Posted January 31, 2008 I have a new project that I would like to work on. It should be pretty simple. All I want to do is pull search results from Google. I would like to keep each result intact. According to my research, each search result shows up with the following code: <div class=g> <!--m--> <h2 class=r> <a href="http://www.web-max.ca/PHP/" class=l onmousedown="return clk(this.href,'','','res','1','')">[TITLE]</a> </h2> <table border=0 cellpadding=0 cellspacing=0> <tr> <td class="j"> <font size=-1>[DESCRIPTION]<br><span class=a>www.URL.com<b>PHP</b>/ - [size]k - </span><nobr><a class=fl href="http://64.233.167.104/search?q=cache:xxxxxxxxxxxx:www.URL.com+search+terms&hl=en&ct=clnk&cd=1&gl=us">Cached</a> - <a class=fl href="/search?hl=en&q=related:www.URL.com">Similar pages</a></nobr></font><!--n--> </td> </tr> </table> </div> I know how to use cURL to get the Search Results contents. All I need to know now is how to parse the rseults. All I basically need to do is extract everything between <div class=g> and </div>. I need to extract all occurences of this. Any help would be greatly appreciated. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 31, 2008 Share Posted January 31, 2008 What have you tried so far? Quote Link to comment Share on other sites More sharing options...
PutterPlace Posted February 4, 2008 Author Share Posted February 4, 2008 This is what I've got right now: <?php function get_content($url) { $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_HEADER, 0); ob_start(); curl_exec ($ch); curl_close ($ch); $string = ob_get_contents(); ob_end_clean(); return $string; } function ExtractText($string) { preg_match_all('/StartTag(.*)EndTag/', $string, $m); return $m[1]; } $content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search"); $content = str_replace("<div class=g>","StartTag",$content); $content = str_replace("</table></div> ","</table>EndTag",$content); $content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content); $output = ExtractText($content); print $output[0]; ?> You can see it's output here: http://www.xtremefilez.com/GoogleResults.php The search term used for this test search is "test term" as you can see from the script above. For some reason, the text "EndTagStartTag" remains at the end of each result. Is there anyway that I can fix this? Thanks in advance. Quote Link to comment Share on other sites More sharing options...
PutterPlace Posted February 4, 2008 Author Share Posted February 4, 2008 I just realized that I could do another str_replace to remove the EndTagStartTag. Now my code looks like this: <?php function get_content($url) { $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_HEADER, 0); ob_start(); curl_exec ($ch); curl_close ($ch); $string = ob_get_contents(); ob_end_clean(); return $string; } function ExtractText($string) { preg_match_all('/StartTag(.*)EndTag/', $string, $m); return $m[1]; } $content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search"); $content = str_replace("<div class=g>","StartTag",$content); $content = str_replace("</table></div> ","</table>EndTag",$content); $content = str_replace("<div class=g style=\"margin-left:2.5em;\">","StartTag",$content); $output = ExtractText($content); $output = str_replace("EndTagStartTag","",$output[0]); print $output; ?> Now that I've gotten that thing out of the way. Can anyone suggest how I could modify the code to only keep the title/link of the results? For example: <h2 class=r><a href="http://www.nwea.org/support/details.aspx?content=901" class=l>How to change or close your <b>Test Term</b> Window</a></h2> Is there a simple regular expression I could use to extract everything between all of the <h2> & </h2> tags on the results page? Thanks in advance for any help that you can provide. Quote Link to comment Share on other sites More sharing options...
effigy Posted February 4, 2008 Share Posted February 4, 2008 preg_match_all('%(?<=<h2 class=r>)(.*?)(?=</h2>)%', $string, $m); Quote Link to comment Share on other sites More sharing options...
PutterPlace Posted February 10, 2008 Author Share Posted February 10, 2008 Thanks effigy. You're a big help. I'm going to go try that right now, and then post back with my results. Quote Link to comment Share on other sites More sharing options...
PutterPlace Posted February 10, 2008 Author Share Posted February 10, 2008 I tried out your suggestion, and came up with the following code: <?php function get_content($url) { $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_HEADER, 0); ob_start(); curl_exec ($ch); curl_close ($ch); $string = ob_get_contents(); ob_end_clean(); return $string; } function ExtractText($string) { preg_match('%(?<=<h2 class=r>)(.*?)(?=</h2>)%', $string, $m); return $m[1]; } $content = get_content("http://www.google.com/search?hl=en&q=test+term&btnG=Google+Search"); $output = ExtractText($content); print_r($output); ?> However, this doesn't seem to work for me. Is there something that I'm doing wrong? Quote Link to comment Share on other sites More sharing options...
PutterPlace Posted February 10, 2008 Author Share Posted February 10, 2008 Nevermind....I simplified the regular expression and came up with the following: preg_match_all('/<h2 class=r>(.*?)<\/h2>/', $string, $m); My code now works perfectly. Thanks for your help effigy. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.