aforaryal Posted July 12, 2009 Share Posted July 12, 2009 I would really appreciate if you guys could help me out on this problem. Basically I was trying to build a content scraper using file_get_contents and regular expressions. Everything worked out fine except for the fact that the page I am trying to scrape generates tables in two different ways. The first would be where everything (eg. Name, location, link etc.) are present. I successfully scraped this table’s contents. Here’s the code of this type of table: <div class="A" id="B" style="display:none"> <div class="C"> X </div> <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%"> <tbody> <tr> <td width="24%" height="20" valign="top">Y</td> <td width="76%" valign="top">Z</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">Z</td> <td valign="top">X</td> </tr> <tr> <td height="20" valign="top">Y</td> <td valign="top">Z</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">Z</td> <td valign="top"><a href="mailto: X">X</a></td> </tr> <tr> <td height="20" valign="top">Y</td> <td valign="top"><a href="Z" target="_blank">Z</a></td> </tr> </tbody> </table> </div> However, the page also generates a few tables where everything (name, location etc.) is present except the cell in which the link is. The scraper I built collapses when this happens and instead ends up scraping an altogether different table’s links, and then continuing through the rest of code. Here’s the code for this “other” type of table. <div class="A" id="B" style="display:none"> <div class="C"> X </div> <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%"> <tbody> <tr> <td width="24%" height="20" valign="top">Y</td> <td width="76%" valign="top">Z</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">Z</td> <td valign="top">X</td> </tr> <tr> <td height="20" valign="top">Y</td> <td valign="top">Z</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top">Y</td> </tr> <tr> <td height="20" valign="top">X</td> <td valign="top"><a href="mailto:Y">Y</a></td> </tr> </tbody> </table> </div> And here’s my scraper. $url='url.html'; $raw=file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $html = str_replace($newlines, "", html_entity_decode($raw)); preg_match_all('/<div class="A" id=".*?" style="display:none">.*?<div class="B">(.*?)<\/div>.*?<div class="C"><table border=".*?0.*?" cellspacing=".*?1.*?" cellpadding=".*?0.*?" width=".*?100%.*?">.*?<tbody>.*?<tr>.*?<td width=".*?24%.*?" height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td width=".*?76%.*?" valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?http:\/\/.*?\/(.*?)" target=".*?_blank.*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<\/tbody>.*?<\/table>.*?<\/div>.*?<\/div>/s', $html, $posts, PREG_SET_ORDER ); Obviously, what I want to do is to scrape the contents of both the tables. If the hyperlink in the very last cell is not present, I would the scraper to jump to the next block of matching code, and not search for a matching hyperlink. I was wondering if you guys had a way past this problem. I am new to regular expressions, and somewhat new to php itself. Thanks! Link to comment https://forums.phpfreaks.com/topic/165662-content-scraper-help-needed/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.