jaxdevil Posted January 22, 2009 Share Posted January 22, 2009 I am trying an example scraper project from a book I bought (PHP Hacks) but it is not working. I typed the data exactly as in the book. I am sure the problem is in these two lines, something in the expression I changed. (It did NOT work as it was written) I have echoed out the $str data and it is grabbing the pages contents, and I do see the codes I want to scrape. Here is the preg_match_all code: // Get just the list sorted by name preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname ); // Get each of the movie entries preg_match_all( '/<SPAN.*?>(.*?)<\/SPAN>.*?<A.*?>(.*?)<BR>/s', $byname[0], $moviedata ); and here is the original way the book had it: // Get just the list sorted by name preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname ); // Get each of the movie entries preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); and here is the code as it is in the $str variable (or at least the portion I am scraping) <div id="sortbyname1"> <p class="listing"> <SPAN CLASS="noscore">xx</SPAN> <A HREF="/video/titles/alphabetkiller">Alphabet Killer [51] => The</A><BR> <SPAN CLASS="red">20</SPAN> <A HREF="/video/titles/americancarol">American Carol [52] => An</A><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/anamorph">Anamorph</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/appaloosa">Appaloosa</A><BR> <SPAN CLASS="red">26</SPAN> <A HREF="/video/titles/babylonad">Babylon A.D.</A><BR> <SPAN CLASS="red">24</SPAN> <A HREF="/video/titles/bangkokdangerous2008">Bangkok Dangerous</A><BR> <SPAN CLASS="yellow">60</SPAN> <A HREF="/video/titles/blindmountain">Blind Mountain</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/bridesheadrevisited">Brideshead Revisited</A><BR> <SPAN CLASS="green">63</SPAN> <A HREF="/video/titles/burnafterreading">Burn After Reading</A><BR> <SPAN CLASS="green">61</SPAN> <A HREF="/video/titles/bustindownthedoor">Bustin' Down the Door</A><BR> <SPAN CLASS="yellow">49</SPAN> <A HREF="/video/titles/childrenofhuangshi">Children of Huang Shi [53] => The</A><BR> <SPAN CLASS="yellow">58</SPAN> <A HREF="/video/titles/cityofember">City of Ember</A><BR> <SPAN CLASS="green">82</SPAN> <A HREF="/video/titles/darkknight"><B>Dark Knight [54] => The</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/deathrace">Death Race</A><BR> <SPAN CLASS="red">15</SPAN> <A HREF="/video/titles/disastermovie">Disaster Movie</A><BR> <SPAN CLASS="green">62</SPAN> <A HREF="/video/titles/duchess2008">Duchess [55] => The</A><BR> <SPAN CLASS="yellow">43</SPAN> <A HREF="/video/titles/eagleeye">Eagle Eye</A><BR> <SPAN CLASS="noscore">xx</SPAN> <A HREF="/video/titles/edenlake">Eden Lake</A><BR> <SPAN CLASS="yellow">58</SPAN> <A HREF="/video/titles/express">Express [56] => The</A><BR> <SPAN CLASS="yellow">49</SPAN> <A HREF="/video/titles/familythatpreys">Family That Preys [57] => The</A><BR> <SPAN CLASS="green">67</SPAN> <A HREF="/video/titles/flowforloveofwater">Flow: For Love of Water</A><BR> <SPAN CLASS="green">72</SPAN> <A HREF="/video/titles/ghosttown">Ghost Town</A><BR> <SPAN CLASS="yellow">54</SPAN> <A HREF="/video/titles/hamlet2">Hamlet 2</A><BR> <SPAN CLASS="green">71</SPAN> <A HREF="/video/titles/hortonhears">Horton Hears a Who!</A><BR> <SPAN CLASS="yellow">55</SPAN> <A HREF="/video/titles/housebunny">House Bunny [58] => The</A><BR> <SPAN CLASS="yellow">40</SPAN> <A HREF="/video/titles/igor">Igor</A><BR> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/mammamia">Mamma Mia!</A><BR> <SPAN CLASS="green">89</SPAN> <A HREF="/video/titles/manonwire"><B>Man on Wire</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR> <SPAN CLASS="red">31</SPAN> <A HREF="/video/titles/maxpayne">Max Payne</A><BR> <SPAN CLASS="red">35</SPAN> <A HREF="/video/titles/mirrors">Mirrors</A><BR> <SPAN CLASS="red">31</SPAN> <A HREF="/video/titles/mummy3">Mummy: Tomb of the Dragon Emperor [59] => The</A><BR> <SPAN CLASS="red">34</SPAN> <A HREF="/video/titles/mybestfriendsgirl">My Best Friend's Girl</A><BR> <SPAN CLASS="green">66</SPAN> <A HREF="/video/titles/pattismith">Patti Smith: Dream of Life</A><BR> <SPAN CLASS="green">64</SPAN> <A HREF="/video/titles/pineappleexpress">Pineapple Express</A><BR> <SPAN CLASS="yellow">55</SPAN> <A HREF="/video/titles/pingpongplaya">Ping Pong Playa</A><BR> <SPAN CLASS="red">32</SPAN> <A HREF="/video/titles/repo">Repo! The Genetic Opera</A><BR> <SPAN CLASS="red">36</SPAN> <A HREF="/video/titles/righteouskill">Righteous Kill</A><BR> <SPAN CLASS="yellow">51</SPAN> <A HREF="/video/titles/savagegrace">Savage Grace</A><BR> <SPAN CLASS="yellow">56</SPAN> <A HREF="/video/titles/saveme">Save Me</A><BR> <SPAN CLASS="red">19</SPAN> <A HREF="/video/titles/sawv">Saw V</A><BR> <SPAN CLASS="red">16</SPAN> <A HREF="/video/titles/surferdude">Surfer [60] => Dude</A><BR> <SPAN CLASS="yellow">47</SPAN> <A HREF="/video/titles/swingvote">Swing Vote</A><BR> <SPAN CLASS="yellow">57</SPAN> <A HREF="/video/titles/towelhead">Towelhead</A><BR> <SPAN CLASS="yellow">60</SPAN> <A HREF="/video/titles/traitor">Traitor</A><BR> <SPAN CLASS="green">61</SPAN> <A HREF="/video/titles/wackness">Wackness [61] => The</A><BR> <SPAN CLASS="green">72</SPAN> <A HREF="/video/titles/womanonthebeach">Woman on the Beach</A><BR> <SPAN CLASS="red">27</SPAN> <A HREF="/video/titles/women2008">Women [62] => The</A><BR> </p> </div> Anyone see how they preg_match_all should be? Below is the entire script if you want to see that: <html> <?php // Set up the CURL object $ch = curl_init( "http://www.metacritic.com/video/" ); // Fake out the User Agent $userAgent = 'Internet Explorer'; curl_setopt( $ch, CURLOPT_USERAGENT, $userAgent ); // Start the output buffering ob_start(); // Get the HTML from MetaCritic curl_exec( $ch ); curl_close( $ch ); // Get the contents of the output buffer $str = ob_get_contents(); ob_end_clean(); // Get just the list sorted by name preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname ); // Get each of the movie entries preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); // Work through the raw movie data $movies = array(); for( $i = 0; $i < count( $moviedata[1] ); $i++ ) { // The score is ok already $score = $moviedata[1][$i]; // We need to remove tags from the title and decode // the HTML entities $title = $moviedata[2][$i]; $title = preg_replace( "/<.*?>/", "", $title ); $title = html_entity_decode( $title ); // Then add the movie to the array $movies []= array( $score, $title ); } ?> <body> <table> <tr> <th>Name</th><th>Score</th> </tr> <?php foreach( $movies as $movie ) { ?> <tr> <td><?php echo( $movie[1] ) ?></td> <td><?php echo( $movie[0] ) ?></td> <tr> <? } ?> </table> </body> </html> Thanks in advance, SK Link to comment https://forums.phpfreaks.com/topic/141954-solved-preg_match_all-not-working/ Share on other sites More sharing options...
effigy Posted January 22, 2009 Share Posted January 22, 2009 Turn your warnings on: Warning: preg_match_all() expects parameter 2 to be string, array given. You need to examine the results; I recommend print_r($byname); Link to comment https://forums.phpfreaks.com/topic/141954-solved-preg_match_all-not-working/#findComment-743300 Share on other sites More sharing options...
jaxdevil Posted January 22, 2009 Author Share Posted January 22, 2009 Thanks! Using your suggesttion I was able to locate the issue. On this line: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); The $byname should be double arrayed like this: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata ); That fixed it! Thanks, SK Link to comment https://forums.phpfreaks.com/topic/141954-solved-preg_match_all-not-working/#findComment-743318 Share on other sites More sharing options...
nrg_alpha Posted January 22, 2009 Share Posted January 22, 2009 Thanks! Using your suggesttion I was able to locate the issue. On this line: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata ); The $byname should be double arrayed like this: preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata ); That fixed it! Thanks, SK If you examine this line: preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname ); , the $byname by nature stores the entire match of the pattern into $byname[0] (where as $byname[1] stores what it captures from (.*?)). So certainly using $byname[0] twice would seem to cause conflict for sure..I do wonder about one thing though.. in your first preg_match_all... you are looking to match all instances that contain "<div id="sortbyname1">....</div>"... but for validation purposes, ids are supposed to be unique (read, there should only be one unique id per (x)html document.. in this case, sortbyname1).. but you are using preg_match_all.. (which means the ability to match / capture your pattern more than once in a target string)... If the code you are scraping is set up correctly, there should only be one div with the id sortbyname1, therefore, you should really only need preg_match, not preg_match_all... Multiple divs that have the same id's should in fact be classes instead: "<div class="sortbyname1">....</div>"....."<div class="sortbyname1">....</div>", and in this case, preg_match_all would be warranted.. This is not to say that multiple items that share the same id will not work / render correctly.. it's just that the page in question won't pass validation. Link to comment https://forums.phpfreaks.com/topic/141954-solved-preg_match_all-not-working/#findComment-743354 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.