jaxdevil Posted January 22, 2009 Share Posted January 22, 2009 I am trying an example scraper project from a book I bought (PHP Hacks) but it is not working. I typed the data exactly as in the book. I am sure the problem is in these two lines, something in the expression I changed. (It did NOT work as it was written) I have echoed out the $str data and it is grabbing the pages contents, and I do see the codes I want to scrape. Here is the preg_match_all code: // Get just the list sorted by name preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname ); // Get each of the movie entries preg_match_all( '/<SPAN.*?>(.*?)<\/SPAN>.*?<A.*?>(.*?)<BR>/s', $byname[0], $moviedata ); and here is the original way the book had it: // Get just the list sorted by name preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname ); // Get each of the movie entries preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); and here is the code as it is in the $str variable (or at least the portion I am scraping) <div id="sortbyname1"> <p class="listing"> <SPAN CLASS="noscore">xx</SPAN> <A HREF="/video/titles/alphabetkiller">Alphabet Killer [51] => The</A><BR> <SPAN CLASS="red">20</SPAN> <A HREF="/video/titles/americancarol">American Carol [52] => An</A><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/anamorph">Anamorph</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/appaloosa">Appaloosa</A><BR> <SPAN CLASS="red">26</SPAN> <A HREF="/video/titles/babylonad">Babylon A.D.</A><BR> <SPAN CLASS="red">24</SPAN> <A HREF="/video/titles/bangkokdangerous2008">Bangkok Dangerous</A><BR> <SPAN CLASS="yellow">60</SPAN> <A HREF="/video/titles/blindmountain">Blind Mountain</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/bridesheadrevisited">Brideshead Revisited</A><BR> <SPAN CLASS="green">63</SPAN> <A HREF="/video/titles/burnafterreading">Burn After Reading</A><BR> <SPAN CLASS="green">61</SPAN> <A HREF="/video/titles/bustindownthedoor">Bustin' Down the Door</A><BR> <SPAN CLASS="yellow">49</SPAN> <A HREF="/video/titles/childrenofhuangshi">Children of Huang Shi [53] => The</A><BR> <SPAN CLASS="yellow">58</SPAN> <A HREF="/video/titles/cityofember">City of Ember</A><BR> <SPAN CLASS="green">82</SPAN> <A HREF="/video/titles/darkknight"><B>Dark Knight [54] => The</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/deathrace">Death Race</A><BR> <SPAN CLASS="red">15</SPAN> <A HREF="/video/titles/disastermovie">Disaster Movie</A><BR> <SPAN CLASS="green">62</SPAN> <A HREF="/video/titles/duchess2008">Duchess [55] => The</A><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/eagleeye">Eagle Eye</A><BR> <SPAN CLASS="noscore">xx</SPAN> <A HREF="/video/titles/edenlake">Eden Lake</A><BR> <SPAN CLASS="yellow">58</SPAN> <A HREF="/video/titles/express">Express [56] => The</A><BR> <SPAN CLASS="yellow">49</SPAN> <A HREF="/video/titles/familythatpreys">Family That Preys [57] => The</A><BR> <SPAN CLASS="green">67</SPAN> <A HREF="/video/titles/flowforloveofwater">Flow: For Love of Water</A><BR> <SPAN CLASS="green">72</SPAN> <A HREF="/video/titles/ghosttown">Ghost Town</A><BR> <SPAN CLASS="yellow">54</SPAN> <A HREF="/video/titles/hamlet2">Hamlet 2</A><BR> <SPAN CLASS="green">71</SPAN> <A HREF="/video/titles/hortonhears">Horton Hears a Who!</A><BR> <SPAN CLASS="yellow">55</SPAN> <A HREF="/video/titles/housebunny">House Bunny [58] => The</A><BR> <SPAN CLASS="yellow">40</SPAN> <A HREF="/video/titles/igor">Igor</A><BR> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/mammamia">Mamma Mia!</A><BR> <SPAN CLASS="green">89</SPAN> <A HREF="/video/titles/manonwire"><B>Man on Wire</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR> <SPAN CLASS="red">31</SPAN> <A HREF="/video/titles/maxpayne">Max Payne</A><BR> <SPAN CLASS="red">35</SPAN> <A HREF="/video/titles/mirrors">Mirrors</A><BR> <SPAN CLASS="red">31</SPAN> <A HREF="/video/titles/mummy3">Mummy: Tomb of the Dragon Emperor [59] => The</A><BR> <SPAN CLASS="red">34</SPAN> <A HREF="/video/titles/mybestfriendsgirl">My Best Friend's Girl</A><BR> <SPAN CLASS="green">66</SPAN> <A HREF="/video/titles/pattismith">Patti Smith: Dream of Life</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/pineappleexpress">Pineapple Express</A><BR> <SPAN CLASS="yellow">55</SPAN> <A HREF="/video/titles/pingpongplaya">Ping Pong Playa</A><BR> <SPAN CLASS="red">32</SPAN> <A HREF="/video/titles/repo">Repo! The Genetic Opera</A><BR> <SPAN CLASS="red">36</SPAN> <A HREF="/video/titles/righteouskill">Righteous Kill</A><BR> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/savagegrace">Savage Grace</A><BR> <SPAN CLASS="yellow">56</SPAN> <A HREF="/video/titles/saveme">Save Me</A><BR> <SPAN CLASS="red">19</SPAN> <A HREF="/video/titles/sawv">Saw V</A><BR> <SPAN CLASS="red">16</SPAN> <A HREF="/video/titles/surferdude">Surfer [60] => Dude</A><BR> <SPAN CLASS="yellow">47</SPAN> <A HREF="/video/titles/swingvote">Swing Vote</A><BR> <SPAN CLASS="yellow">57</SPAN> <A HREF="/video/titles/towelhead">Towelhead</A><BR> <SPAN CLASS="yellow">60</SPAN> <A HREF="/video/titles/traitor">Traitor</A><BR> <SPAN CLASS="green">61</SPAN> <A HREF="/video/titles/wackness">Wackness [61] => The</A><BR> <SPAN CLASS="green">72</SPAN> <A HREF="/video/titles/womanonthebeach">Woman on the Beach</A><BR> <SPAN CLASS="red">27</SPAN> <A HREF="/video/titles/women2008">Women [62] => The</A><BR> </p> </div> Anyone see how they preg_match_all should be? Below is the entire script if you want to see that: <html> <?php // Set up the CURL object $ch = curl_init( "http://www.metacritic.com/video/" ); // Fake out the User Agent $userAgent = 'Internet Explorer'; curl_setopt( $ch, CURLOPT_USERAGENT, $userAgent ); // Start the output buffering ob_start(); // Get the HTML from MetaCritic curl_exec( $ch ); curl_close( $ch ); // Get the contents of the output buffer $str = ob_get_contents(); ob_end_clean(); // Get just the list sorted by name preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname ); // Get each of the movie entries preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); // Work through the raw movie data $movies = array(); for( $i = 0; $i < count( $moviedata[1] ); $i++ ) { // The score is ok already $score = $moviedata[1][$i]; // We need to remove tags from the title and decode // the HTML entities $title = $moviedata[2][$i]; $title = preg_replace( "/<.*?>/", "", $title ); $title = html_entity_decode( $title ); // Then add the movie to the array $movies []= array( $score, $title ); } ?> <body> <table> <tr> <th>Name</th><th>Score</th> </tr> <?php foreach( $movies as $movie ) { ?> <tr> <td><?php echo( $movie[1] ) ?></td> <td><?php echo( $movie[0] ) ?></td> <tr> <? } ?> </table> </body> </html> Thanks in advance, SK Quote Link to comment Share on other sites More sharing options...
effigy Posted January 22, 2009 Share Posted January 22, 2009 Turn your warnings on: Warning: preg_match_all() expects parameter 2 to be string, array given. You need to examine the results; I recommend print_r($byname); Quote Link to comment Share on other sites More sharing options...
jaxdevil Posted January 22, 2009 Author Share Posted January 22, 2009 Thanks! Using your suggesttion I was able to locate the issue. On this line: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); The $byname should be double arrayed like this: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata ); That fixed it! Thanks, SK Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted January 22, 2009 Share Posted January 22, 2009 Thanks! Using your suggesttion I was able to locate the issue. On this line: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); The $byname should be double arrayed like this: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata ); That fixed it! Thanks, SK If you examine this line: preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname ); , the $byname by nature stores the entire match of the pattern into $byname[0] (where as $byname[1] stores what it captures from (.*?)). So certainly using $byname[0] twice would seem to cause conflict for sure..I do wonder about one thing though.. in your first preg_match_all... you are looking to match all instances that contain "<div id="sortbyname1">....</div>"... but for validation purposes, ids are supposed to be unique (read, there should only be one unique id per (x)html document.. in this case, sortbyname1).. but you are using preg_match_all.. (which means the ability to match / capture your pattern more than once in a target string)... If the code you are scraping is set up correctly, there should only be one div with the id sortbyname1, therefore, you should really only need preg_match, not preg_match_all... Multiple divs that have the same id's should in fact be classes instead: "<div class="sortbyname1">....</div>"....."<div class="sortbyname1">....</div>", and in this case, preg_match_all would be warranted.. This is not to say that multiple items that share the same id will not work / render correctly.. it's just that the page in question won't pass validation. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.