Jump to content

[SOLVED] preg_match_all not working


jaxdevil

Recommended Posts

I am trying an example scraper project from a book I bought (PHP Hacks) but it is not working. I typed the data exactly as in the book. I am sure the problem is in these two lines, something in the expression I changed. (It did NOT work as it was written) I have echoed out the $str data and it is grabbing the pages contents, and I do see the codes I want to scrape. Here is the preg_match_all code:

 

// Get just the list sorted by name
preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname );

// Get each of the movie entries
preg_match_all( '/<SPAN.*?>(.*?)<\/SPAN>.*?<A.*?>(.*?)<BR>/s', $byname[0], $moviedata );

 

and here is the original way the book had it:

// Get just the list sorted by name
preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

and here is the code as it is in the $str variable (or at least the portion I am scraping)

 

      <div id="sortbyname1">

      <p class="listing">





  <SPAN CLASS="noscore">xx</SPAN>
  
    
    
      <A HREF="/video/titles/alphabetkiller">Alphabet Killer
    [51] =>  The</A><BR>
    
  

  <SPAN CLASS="red">20</SPAN>

  
    
    
      <A HREF="/video/titles/americancarol">American Carol
    [52] =>  An</A><BR>
    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/anamorph">Anamorph</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>
  
    
    
      <A HREF="/video/titles/appaloosa">Appaloosa</A><BR>
    
  

  <SPAN CLASS="red">26</SPAN>

  
    
    
      <A HREF="/video/titles/babylonad">Babylon A.D.</A><BR>
    
  

  <SPAN CLASS="red">24</SPAN>
  
    
    
      <A HREF="/video/titles/bangkokdangerous2008">Bangkok Dangerous</A><BR>
    
  

  <SPAN CLASS="yellow">60</SPAN>
  
    
    
      <A HREF="/video/titles/blindmountain">Blind Mountain</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>

  
    
    
      <A HREF="/video/titles/bridesheadrevisited">Brideshead Revisited</A><BR>
    
  

  <SPAN CLASS="green">63</SPAN>
  
    
    
      <A HREF="/video/titles/burnafterreading">Burn After Reading</A><BR>
    
  

  <SPAN CLASS="green">61</SPAN>
  
    
    
      <A HREF="/video/titles/bustindownthedoor">Bustin&#039; Down the Door</A><BR>

    
  

  <SPAN CLASS="yellow">49</SPAN>
  
    
    
      <A HREF="/video/titles/childrenofhuangshi">Children of Huang Shi
    [53] =>  The</A><BR>
    
  

  <SPAN CLASS="yellow">58</SPAN>
  
    
    
      <A HREF="/video/titles/cityofember">City of Ember</A><BR>
    
  

  <SPAN CLASS="green">82</SPAN>
  
    
      <A HREF="/video/titles/darkknight"><B>Dark Knight
    [54] =>  The</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>

    
    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/deathrace">Death Race</A><BR>
    
  

  <SPAN CLASS="red">15</SPAN>
  
    
    
      <A HREF="/video/titles/disastermovie">Disaster Movie</A><BR>
    
  

  <SPAN CLASS="green">62</SPAN>
  
    
    
      <A HREF="/video/titles/duchess2008">Duchess
    [55] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">43</SPAN>
  
    
    
      <A HREF="/video/titles/eagleeye">Eagle Eye</A><BR>
    
  

  <SPAN CLASS="noscore">xx</SPAN>
  
    
    
      <A HREF="/video/titles/edenlake">Eden Lake</A><BR>
    
  

  <SPAN CLASS="yellow">58</SPAN>
  
    
    
      <A HREF="/video/titles/express">Express
    [56] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">49</SPAN>
  
    
    
      <A HREF="/video/titles/familythatpreys">Family That Preys
    [57] =>  The</A><BR>
    
  

  <SPAN CLASS="green">67</SPAN>
  
    
    
      <A HREF="/video/titles/flowforloveofwater">Flow: For Love of Water</A><BR>
    
  

  <SPAN CLASS="green">72</SPAN>
  
    
    
      <A HREF="/video/titles/ghosttown">Ghost Town</A><BR>

    
  

  <SPAN CLASS="yellow">54</SPAN>
  
    
    
      <A HREF="/video/titles/hamlet2">Hamlet 2</A><BR>
    
  

  <SPAN CLASS="green">71</SPAN>
  
    
    
      <A HREF="/video/titles/hortonhears">Horton Hears a Who!</A><BR>
    
  

  <SPAN CLASS="yellow">55</SPAN>
  
    
    
      <A HREF="/video/titles/housebunny">House Bunny
    [58] =>  The</A><BR>

    
  

  <SPAN CLASS="yellow">40</SPAN>
  
    
    
      <A HREF="/video/titles/igor">Igor</A><BR>
    
  

  <SPAN CLASS="yellow">51</SPAN>
  
    
    
      <A HREF="/video/titles/mammamia">Mamma Mia!</A><BR>
    
  

  <SPAN CLASS="green">89</SPAN>
  
    
      <A HREF="/video/titles/manonwire"><B>Man on Wire</B></A><IMG SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>

    
    
  

  <SPAN CLASS="red">31</SPAN>
  
    
    
      <A HREF="/video/titles/maxpayne">Max Payne</A><BR>
    
  

  <SPAN CLASS="red">35</SPAN>
  
    
    
      <A HREF="/video/titles/mirrors">Mirrors</A><BR>
    
  

  <SPAN CLASS="red">31</SPAN>
  
    
    
      <A HREF="/video/titles/mummy3">Mummy: Tomb of the Dragon Emperor
    [59] =>  The</A><BR>

    
  

  <SPAN CLASS="red">34</SPAN>
  
    
    
      <A HREF="/video/titles/mybestfriendsgirl">My Best Friend&#039;s Girl</A><BR>
    
  

  <SPAN CLASS="green">66</SPAN>
  
    
    
      <A HREF="/video/titles/pattismith">Patti Smith: Dream of Life</A><BR>
    
  

  <SPAN CLASS="green">64</SPAN>
  
    
    
      <A HREF="/video/titles/pineappleexpress">Pineapple Express</A><BR>

    
  

  <SPAN CLASS="yellow">55</SPAN>
  
    
    
      <A HREF="/video/titles/pingpongplaya">Ping Pong Playa</A><BR>
    
  

  <SPAN CLASS="red">32</SPAN>
  
    
    
      <A HREF="/video/titles/repo">Repo! The Genetic Opera</A><BR>
    
  

  <SPAN CLASS="red">36</SPAN>
  
    
    
      <A HREF="/video/titles/righteouskill">Righteous Kill</A><BR>

    
  

  <SPAN CLASS="yellow">51</SPAN>
  
    
    
      <A HREF="/video/titles/savagegrace">Savage Grace</A><BR>
    
  

  <SPAN CLASS="yellow">56</SPAN>
  
    
    
      <A HREF="/video/titles/saveme">Save Me</A><BR>
    
  

  <SPAN CLASS="red">19</SPAN>
  
    
    
      <A HREF="/video/titles/sawv">Saw V</A><BR>

    
  

  <SPAN CLASS="red">16</SPAN>
  
    
    
      <A HREF="/video/titles/surferdude">Surfer
    [60] =>  Dude</A><BR>
    
  

  <SPAN CLASS="yellow">47</SPAN>
  
    
    
      <A HREF="/video/titles/swingvote">Swing Vote</A><BR>
    
  

  <SPAN CLASS="yellow">57</SPAN>
  
    
    
      <A HREF="/video/titles/towelhead">Towelhead</A><BR>

    
  

  <SPAN CLASS="yellow">60</SPAN>
  
    
    
      <A HREF="/video/titles/traitor">Traitor</A><BR>
    
  

  <SPAN CLASS="green">61</SPAN>
  
    
    
      <A HREF="/video/titles/wackness">Wackness
    [61] =>  The</A><BR>
    
  

  <SPAN CLASS="green">72</SPAN>
  
    
    
      <A HREF="/video/titles/womanonthebeach">Woman on the Beach</A><BR>

    
  

  <SPAN CLASS="red">27</SPAN>
  
    
    
      <A HREF="/video/titles/women2008">Women
    [62] =>  The</A><BR>
    
  


</p>

</div>

 

Anyone see how they preg_match_all should be? Below is the entire script if you want to see that:

 

<html>
<?php

// Set up the CURL object
$ch = curl_init( "http://www.metacritic.com/video/" );

// Fake out the User Agent
$userAgent = 'Internet Explorer';
curl_setopt( $ch, CURLOPT_USERAGENT, $userAgent );

// Start the output buffering
ob_start();

// Get the HTML from MetaCritic
curl_exec( $ch );
curl_close( $ch );

// Get the contents of the output buffer
$str = ob_get_contents();
ob_end_clean();

// Get just the list sorted by name
preg_match_all( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is", $str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

// Work through the raw movie data
$movies = array();
for( $i = 0; $i < count( $moviedata[1] ); $i++ )
{
// The score is ok already
$score = $moviedata[1][$i];

// We need to remove tags from the title and decode
// the HTML entities
$title = $moviedata[2][$i];
$title = preg_replace( "/<.*?>/", "", $title );
$title = html_entity_decode( $title );

// Then add the movie to the array
$movies []= array( $score, $title );
}
?>
<body>
<table>
<tr>
<th>Name</th><th>Score</th>
</tr>
<?php foreach( $movies as $movie ) { ?>
<tr>
<td><?php echo( $movie[1] ) ?></td>
<td><?php echo( $movie[0] ) ?></td>
<tr>
<? } ?>
</table>
</body>
</html>

 

Thanks in advance,

SK

Link to comment
https://forums.phpfreaks.com/topic/141954-solved-preg_match_all-not-working/
Share on other sites

Thanks! Using your suggesttion I was able to locate the issue. On this line:

 

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

The $byname should be double arrayed like this:

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata );

 

That fixed it!

 

Thanks,

SK

Thanks! Using your suggesttion I was able to locate the issue. On this line:

 

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0], $moviedata );

 

The $byname should be double arrayed like this:

preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is", $byname[0][0], $moviedata );

 

That fixed it!

 

Thanks,

SK

 

If you examine this line:

preg_match_all( '/<div id="sortbyname1">(.*?)<\/div>/s', $str, $byname );

, the $byname by nature stores the entire match of the pattern into $byname[0] (where as $byname[1] stores what it captures from (.*?)). So certainly using $byname[0] twice would seem to cause conflict for sure..I do wonder about one thing though.. in your first preg_match_all... you are looking to match all instances that contain "<div id="sortbyname1">....</div>"... but for validation purposes, ids are supposed to be unique (read, there should only be one unique id per (x)html document.. in this case, sortbyname1).. but you are using preg_match_all.. (which means the ability to match / capture your pattern more than once in a target string)... If the code you are scraping is set up correctly, there should only be one div with the id sortbyname1, therefore, you should really only need preg_match, not preg_match_all...

 

Multiple divs that have the same id's should in fact be classes instead: "<div class="sortbyname1">....</div>"....."<div class="sortbyname1">....</div>", and in this case, preg_match_all would be warranted.. This is not to say that multiple items that share the same id will not work / render correctly.. it's just that the page in question won't pass validation.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.