Jump to content

Content Scraper help needed


aforaryal

Recommended Posts

I would really appreciate if you guys could help me out on this problem.

 

Basically I was trying to build a content scraper using file_get_contents and regular expressions. Everything worked out fine except for the fact that the page I am trying to scrape generates tables in two different ways.

 

The first would be where everything (eg. Name, location, link etc.) are present. I successfully scraped this table’s contents.

 

Here’s the code of this type of table:

 

<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
<tbody>
	<tr>
		<td width="24%" height="20" valign="top">Y</td>
		<td width="76%" valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top">X</td>
	</tr>
	<tr>

		<td height="20" valign="top">Y</td>
		<td valign="top">Z</td>
	</tr>
<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top"><a href="mailto: X">X</a></td>
	</tr>
	<tr>
		<td height="20" valign="top">Y</td>
		<td valign="top"><a href="Z" target="_blank">Z</a></td>
	</tr>	</tbody>
</table>
</div>

 

However, the page also generates a few tables where everything (name, location etc.) is present except the cell in which the link is. The scraper I built collapses when this happens and instead ends up scraping an altogether different table’s links, and then continuing through the rest of code.

 

Here’s the code for this “other” type of table.

 

<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
<tbody>
	<tr>
		<td width="24%" height="20" valign="top">Y</td>
		<td width="76%" valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">Z</td>
		<td valign="top">X</td>
	</tr>
	<tr>

		<td height="20" valign="top">Y</td>
		<td valign="top">Z</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top">Y</td>
	</tr>
	<tr>

		<td height="20" valign="top">X</td>
		<td valign="top"><a href="mailto:Y">Y</a></td>
	</tr>
</tbody>
</table>
</div>

 

And here’s my scraper.

$url='url.html';
$raw=file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$html = str_replace($newlines, "", html_entity_decode($raw));

preg_match_all('/<div class="A" id=".*?" style="display:none">.*?<div class="B">(.*?)<\/div>.*?<div class="C"><table border=".*?0.*?" cellspacing=".*?1.*?" cellpadding=".*?0.*?" width=".*?100%.*?">.*?<tbody>.*?<tr>.*?<td width=".*?24%.*?" height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td width=".*?76%.*?" valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?http:\/\/.*?\/(.*?)" target=".*?_blank.*?">(.*?)<\/a><\/td>.*?<\/tr>.*?<\/tbody>.*?<\/table>.*?<\/div>.*?<\/div>/s', 
$html,
    $posts, 
    PREG_SET_ORDER 
);

Obviously, what I want to do is to scrape the contents of both the tables. If the hyperlink in the very last cell is not present, I would the scraper to jump to the next block of matching code, and not search for a matching hyperlink.

 

I was wondering if you guys had a way past this problem. I am new to regular expressions, and somewhat new to php itself.

 

Thanks!

 

Link to comment
https://forums.phpfreaks.com/topic/165662-content-scraper-help-needed/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.