Content Scraper help needed

aforaryal · July 12, 2009

I would really appreciate if you guys could help me out on this problem.

Basically I was trying to build a content scraper using file_get_contents and regular expressions. Everything worked out fine except for the fact that the page I am trying to scrape generates tables in two different ways.

The first would be where everything (eg. Name, location, link etc.) are present. I successfully scraped this table’s contents.

Here’s the code of this type of table:

<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
<tbody>
	<tr>
		<td width="24%" height="20" valign="top">Y</td>
		<td width="76%" valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top">X</td>
	</tr>
	<tr>

		<td height="20" valign="top">Y</td>
		<td valign="top">Z</td>
	</tr>
<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top"><a href="mailto: X">X</a></td>
	</tr>
	<tr>
		<td height="20" valign="top">Y</td>
		<td valign="top"><a href="Z" target="_blank">Z</a></td>
	</tr>	</tbody>
</table>
</div>

However, the page also generates a few tables where everything (name, location etc.) is present except the cell in which the link is. The scraper I built collapses when this happens and instead ends up scraping an altogether different table’s links, and then continuing through the rest of code.

Here’s the code for this “other” type of table.

<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
<tbody>
	<tr>
		<td width="24%" height="20" valign="top">Y</td>
		<td width="76%" valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top">X</td>
	</tr>
	<tr>

		<td height="20" valign="top">Y</td>
		<td valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top"><a href="mailto:Y">Y</a></td>
	</tr>
</tbody>
</table>
</div>

And here’s my scraper.

$url='url.html';
$raw=file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$html = str_replace($newlines, "", html_entity_decode($raw));

preg_match_all('/<div class="A" id=".*?" style="display:none">.*?<div class="B">(.*?)<\/div>.*?<div class="C"><table border=".*?0.*?" cellspacing=".*?1.*?" cellpadding=".*?0.*?" width=".*?100%.*?">.*?<tbody>.*?<tr>.*?<td width=".*?24%.*?" height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td width=".*?76%.*?" valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?http:\/\/.*?\/(.*?)" target=".*?_blank.*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<\/tbody>.*?<\/table>.*?<\/div>.*?<\/div>/s', 
$html,
    $posts, 
    PREG_SET_ORDER 
);

Obviously, what I want to do is to scrape the contents of both the tables. If the hyperlink in the very last cell is not present, I would the scraper to jump to the next block of matching code, and not search for a matching hyperlink.

I was wondering if you guys had a way past this problem. I am new to regular expressions, and somewhat new to php itself.

Thanks!

Sign In

Content Scraper help needed

Recommended Posts

aforaryal

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information