Parsing HTML to strip required data

TGWSE_GY · April 3, 2011

So I have an interesting one for you guys this AM,

I first want to make it very clear that I am not scraping code, rather I am scraping data that is needed to import into a shopping cart system for someone.

I have a URL that I am trying to scrape required data off of, however it is not returning all the data that I want. I have created a function that uses preg_match_all() and regex and I am still having issues striping what I want.

here is a link to my test what I am wanting to strip from http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html

I am wanting to grab all this data:

Item Number:
K-686

Style : Shiraz

Province : Fars

Made In : Iran

Foundation : Wool

Pile : 100% Wool

Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) : 4' 11" x 3' 4"

Size (Centimeter) : 155 x 103

Age : 20-25 Years Old

Condition : Very Good

KPSI (knots per sq. inch) : 130 knots per square inch

Woven : Hand Knotted

Shipping and Handling : Free Shipping(For Mainland USA)

Est. Retail Value : $2,700.00

Here is the code note that $url holds the link above.

$html = file_get_contents($url);

	$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

	$html = str_replace($newlinews, "", html_entity_decode($html));
	preg_match_all('/<tr><td width="50%" align="right"><font color="#800000"><b>[^\s ](.*?)<\/b><\/font><\/td><td width="50%" align="left">[^\s ](.*?)<\/td><\/tr>/', $html, $matches, PREG_SET_ORDER);

	foreach($matches_label as $match){
			$count = 0;
			echo $match[$count];
			echo "<br>";
			$count++;

	}
	echo $count;

This returns the following

Style : Shiraz
Province : Fars

Foundation : Wool

Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) : 4' 11" x 3' 4"

Size (Centimeter) : 155 x 103

Age : 20-25 Years Old

Condition : Very Good

Est. Retail Value : $2,700.00

1

it is missing:

Inventory Number : xxxxxxx

Made In: xxxxxxxx

Pile : xxxxxxxxxx

KPSI(Knots Per Inch) : xxxxxxxxxx

Woven : xxxxxxxxx

Shopping : xxxxxxxxxxx

You can see the script in action here -> http://visualrealityink.com/dev/clients/rug_src/scrapeing/scrape_tst.php

Thanks in advance for all of your help

sasa · April 3, 2011

try

<?php
$url = 'http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html';
$file = file_get_contents($url);
$pattern = '/<b>([^<:]+)\s*:  (.*?)<\/td><\/tr>/s';
preg_match_all($pattern, $file, $matchesarray);
foreach($matchesarray[1] as $i => $key) $out[trim($key)] = trim(strip_tags ($matchesarray[2][$i]));
echo '<pre>', print_r($out), '</pre>';
?>

TGWSE_GY · April 3, 2011

sasa thank you so much as usuall your advice and code are flawless!

Sign In

Parsing HTML to strip required data

Recommended Posts

TGWSE_GY

Link to comment

Share on other sites

sasa

Link to comment

Share on other sites

TGWSE_GY

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information