Jump to content

[SOLVED] preg_match_all not working


jaxdevil

Recommended Posts

I am trying an example scraper project from a book I bought (PHP Hacks) but it is not working. I typed the data exactly as in the book. I am sure the problem is in these two lines, something in the expression I changed. (It did NOT work as it was written) I have echoed out the $str data and it is grabbing the pages contents, and I do see the codes I want to scrape. Here is the preg_match_all code:

 

// Get just the list sorted by name
preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname );

// Get each of the movie entries
preg_match_all( '/<SPAN.*?>(.*?)<\/SPAN>.*?<A.*?>(.*?)<BR>/s', $byname[0], $moviedata );

 

and here is the original way the book had it:

// Get just the list sorted by name
preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

and here is the code as it is in the $str variable (or at least the portion I am scraping)

 

      <div id="sortbyname1">

      <p class="listing">





  <SPAN CLASS="noscore">xx</SPAN>
  
    
    
      <A HREF="/video/titles/alphabetkiller">Alphabet Killer
    [51] =>  The</A><BR>
    
  

  <SPAN CLASS="red">20</SPAN>

  
    
    
      <A HREF="/video/titles/americancarol">American Carol
    [52] =>  An</A><BR>
    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/anamorph">Anamorph</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>
  
    
    
      <A HREF="/video/titles/appaloosa">Appaloosa</A><BR>
    
  

  <SPAN CLASS="red">26</SPAN>

  
    
    
      <A HREF="/video/titles/babylonad">Babylon A.D.</A><BR>
    
  

  <SPAN CLASS="red">24</SPAN>
  
    
    
      <A HREF="/video/titles/bangkokdangerous2008">Bangkok Dangerous</A><BR>
    
  

  <SPAN CLASS="yellow">60</SPAN>
  
    
    
      <A HREF="/video/titles/blindmountain">Blind Mountain</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>

  
    
    
      <A HREF="/video/titles/bridesheadrevisited">Brideshead Revisited</A><BR>
    
  

  <SPAN CLASS="green">63</SPAN>
  
    
    
      <A HREF="/video/titles/burnafterreading">Burn After Reading</A><BR>
    
  

  <SPAN CLASS="green">61</SPAN>
  
    
    
      <A HREF="/video/titles/bustindownthedoor">Bustin&#039; Down the Door</A><BR>

    
  

  <SPAN CLASS="yellow">49</SPAN>
  
    
    
      <A HREF="/video/titles/childrenofhuangshi">Children of Huang Shi
    [53] =>  The</A><BR>
    
  

  <SPAN CLASS="yellow">58</SPAN>
  
    
    
      <A HREF="/video/titles/cityofember">City of Ember</A><BR>
    
  

  <SPAN CLASS="green">82</SPAN>
  
    
      <A HREF="/video/titles/darkknight"><B>Dark Knight
    [54] =>  The</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>

    
    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/deathrace">Death Race</A><BR>
    
  

  <SPAN CLASS="red">15</SPAN>
  
    
    
      <A HREF="/video/titles/disastermovie">Disaster Movie</A><BR>
    
  

  <SPAN CLASS="green">62</SPAN>
  
    
    
      <A HREF="/video/titles/duchess2008">Duchess
    [55] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/eagleeye">Eagle Eye</A><BR>
    
  

  <SPAN CLASS="noscore">xx</SPAN>
  
    
    
      <A HREF="/video/titles/edenlake">Eden Lake</A><BR>
    
  

  <SPAN CLASS="yellow">58</SPAN>
  
    
    
      <A HREF="/video/titles/express">Express
    [56] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">49</SPAN>
  
    
    
      <A HREF="/video/titles/familythatpreys">Family That Preys
    [57] =>  The</A><BR>
    
  

  <SPAN CLASS="green">67</SPAN>
  
    
    
      <A HREF="/video/titles/flowforloveofwater">Flow: For Love of Water</A><BR>
    
  

  <SPAN CLASS="green">72</SPAN>
  
    
    
      <A HREF="/video/titles/ghosttown">Ghost Town</A><BR>

    
  

  <SPAN CLASS="yellow">54</SPAN>
  
    
    
      <A HREF="/video/titles/hamlet2">Hamlet 2</A><BR>
    
  

  <SPAN CLASS="green">71</SPAN>
  
    
    
      <A HREF="/video/titles/hortonhears">Horton Hears a Who!</A><BR>
    
  

  <SPAN CLASS="yellow">55</SPAN>
  
    
    
      <A HREF="/video/titles/housebunny">House Bunny
    [58] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">40</SPAN>
  
    
    
      <A HREF="/video/titles/igor">Igor</A><BR>
    
  

  <SPAN CLASS="yellow">51</SPAN>
  
    
    
      <A HREF="/video/titles/mammamia">Mamma Mia!</A><BR>
    
  

  <SPAN CLASS="green">89</SPAN>
  
    
      <A HREF="/video/titles/manonwire"><B>Man on Wire</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>

    
    
  

  <SPAN CLASS="red">31</SPAN>
  
    
    
      <A HREF="/video/titles/maxpayne">Max Payne</A><BR>
    
  

  <SPAN CLASS="red">35</SPAN>
  
    
    
      <A HREF="/video/titles/mirrors">Mirrors</A><BR>
    
  

  <SPAN CLASS="red">31</SPAN>
  
    
    
      <A HREF="/video/titles/mummy3">Mummy: Tomb of the Dragon Emperor
    [59] =>  The</A><BR>

    
  

  <SPAN CLASS="red">34</SPAN>
  
    
    
      <A HREF="/video/titles/mybestfriendsgirl">My Best Friend&#039;s Girl</A><BR>
    
  

  <SPAN CLASS="green">66</SPAN>
  
    
    
      <A HREF="/video/titles/pattismith">Patti Smith: Dream of Life</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>
  
    
    
      <A HREF="/video/titles/pineappleexpress">Pineapple Express</A><BR>

    
  

  <SPAN CLASS="yellow">55</SPAN>
  
    
    
      <A HREF="/video/titles/pingpongplaya">Ping Pong Playa</A><BR>
    
  

  <SPAN CLASS="red">32</SPAN>
  
    
    
      <A HREF="/video/titles/repo">Repo! The Genetic Opera</A><BR>
    
  

  <SPAN CLASS="red">36</SPAN>
  
    
    
      <A HREF="/video/titles/righteouskill">Righteous Kill</A><BR>

    
  

  <SPAN CLASS="yellow">51</SPAN>
  
    
    
      <A HREF="/video/titles/savagegrace">Savage Grace</A><BR>
    
  

  <SPAN CLASS="yellow">56</SPAN>
  
    
    
      <A HREF="/video/titles/saveme">Save Me</A><BR>
    
  

  <SPAN CLASS="red">19</SPAN>
  
    
    
      <A HREF="/video/titles/sawv">Saw V</A><BR>

    
  

  <SPAN CLASS="red">16</SPAN>
  
    
    
      <A HREF="/video/titles/surferdude">Surfer
    [60] =>  Dude</A><BR>
    
  

  <SPAN CLASS="yellow">47</SPAN>
  
    
    
      <A HREF="/video/titles/swingvote">Swing Vote</A><BR>
    
  

  <SPAN CLASS="yellow">57</SPAN>
  
    
    
      <A HREF="/video/titles/towelhead">Towelhead</A><BR>

    
  

  <SPAN CLASS="yellow">60</SPAN>
  
    
    
      <A HREF="/video/titles/traitor">Traitor</A><BR>
    
  

  <SPAN CLASS="green">61</SPAN>
  
    
    
      <A HREF="/video/titles/wackness">Wackness
    [61] =>  The</A><BR>
    
  

  <SPAN CLASS="green">72</SPAN>
  
    
    
      <A HREF="/video/titles/womanonthebeach">Woman on the Beach</A><BR>

    
  

  <SPAN CLASS="red">27</SPAN>
  
    
    
      <A HREF="/video/titles/women2008">Women
    [62] =>  The</A><BR>
    
  


</p>

</div>

 

Anyone see how they preg_match_all should be? Below is the entire script if you want to see that:

 

<html>
<?php

// Set up the CURL object
$ch = curl_init( "http://www.metacritic.com/video/" );

// Fake out the User Agent
$userAgent = 'Internet Explorer';
curl_setopt( $ch, CURLOPT_USERAGENT, $userAgent );

// Start the output buffering
ob_start();

// Get the HTML from MetaCritic
curl_exec( $ch );
curl_close( $ch );

// Get the contents of the output buffer
$str = ob_get_contents();
ob_end_clean();

// Get just the list sorted by name
preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

// Work through the raw movie data
$movies = array();
for( $i = 0; $i < count( $moviedata[1] ); $i++ )
{
// The score is ok already
$score = $moviedata[1][$i];

// We need to remove tags from the title and decode
// the HTML entities
$title = $moviedata[2][$i];
$title = preg_replace( "/<.*?>/", "", $title );
$title = html_entity_decode( $title );

// Then add the movie to the array
$movies []= array( $score, $title );
}
?>
<body>
<table>
<tr>
<th>Name</th><th>Score</th>
</tr>
<?php foreach( $movies as $movie ) { ?>
<tr>
<td><?php echo( $movie[1] ) ?></td>
<td><?php echo( $movie[0] ) ?></td>
<tr>
<? } ?>
</table>
</body>
</html>

 

Thanks in advance,

SK

Link to comment
Share on other sites

Thanks! Using your suggesttion I was able to locate the issue. On this line:

 

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

The $byname should be double arrayed like this:

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata );

 

That fixed it!

 

Thanks,

SK

Link to comment
Share on other sites

Thanks! Using your suggesttion I was able to locate the issue. On this line:

 

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

The $byname should be double arrayed like this:

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata );

 

That fixed it!

 

Thanks,

SK

 

If you examine this line:

preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname );

, the $byname by nature stores the entire match of the pattern into $byname[0] (where as $byname[1] stores what it captures from (.*?)). So certainly using $byname[0] twice would seem to cause conflict for sure..I do wonder about one thing though.. in your first preg_match_all... you are looking to match all instances that contain "<div id="sortbyname1">....</div>"... but for validation purposes, ids are supposed to be unique (read, there should only be one unique id per (x)html document.. in this case, sortbyname1).. but you are using preg_match_all.. (which means the ability to match / capture your pattern more than once in a target string)... If the code you are scraping is set up correctly, there should only be one div with the id sortbyname1, therefore, you should really only need preg_match, not preg_match_all...

 

Multiple divs that have the same id's should in fact be classes instead: "<div class="sortbyname1">....</div>"....."<div class="sortbyname1">....</div>", and in this case, preg_match_all would be warranted.. This is not to say that multiple items that share the same id will not work / render correctly.. it's just that the page in question won't pass validation.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.