Jump to content

Parsing HTML to strip required data


TGWSE_GY

Recommended Posts

So I have an interesting one for you guys this AM,

I first want to make it very clear that I am not scraping code, rather I am scraping data that is needed to import into a shopping cart system for someone.

I have a URL that I am trying to scrape required data off of, however it is not returning all the data that I want. I have created a function that uses preg_match_all() and regex and I am still having issues striping what I want.

 

here is a link to my test what I am wanting to strip from http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html

 

I am wanting to grab all this data:

 

  Quote
Item Number: 

K-686

Style :  Shiraz

Province :  Fars

Made In :  Iran

Foundation :  Wool

Pile :  100% Wool

Colors :  Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) :  4' 11" x  3' 4"

Size (Centimeter) :  155 x 103

Age :  20-25 Years Old

Condition :  Very Good

KPSI (knots per sq. inch) :  130 knots per square inch

Woven :  Hand Knotted

Shipping and Handling :  Free Shipping(For Mainland USA)

Est. Retail Value :  $2,700.00

 

Here is the code note that $url holds the link above.

$html = file_get_contents($url);

	$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

	$html = str_replace($newlinews, "", html_entity_decode($html));
	preg_match_all('/<tr><td width="50%" align="right"><font color="#800000"><b>[^\s ](.*?)<\/b><\/font><\/td><td width="50%" align="left">[^\s ](.*?)<\/td><\/tr>/', $html, $matches, PREG_SET_ORDER);

	foreach($matches_label as $match){
			$count = 0;
			echo $match[$count];
			echo "<br>";
			$count++;

	}
	echo $count;

 

This returns the following

  Quote
Style :  Shiraz

Province :  Fars

Foundation :  Wool

Colors :  Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) :  4' 11" x  3' 4"

Size (Centimeter) :  155 x 103

Age :  20-25 Years Old

Condition :  Very Good

Est. Retail Value :  $2,700.00

1

 

it is missing:

  Quote

Inventory Number : xxxxxxx

Made In: xxxxxxxx

Pile : xxxxxxxxxx

KPSI(Knots Per Inch) : xxxxxxxxxx

Woven : xxxxxxxxx

Shopping : xxxxxxxxxxx

 

You can see the script in action here -> http://visualrealityink.com/dev/clients/rug_src/scrapeing/scrape_tst.php

 

Thanks in advance for all of your help :)

Link to comment
https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/
Share on other sites

try

<?php
$url = 'http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html';
$file = file_get_contents($url);
$pattern = '/<b>([^<:]+)\s*:  (.*?)<\/td><\/tr>/s';
preg_match_all($pattern, $file, $matchesarray);
foreach($matchesarray[1] as $i => $key) $out[trim($key)] = trim(strip_tags ($matchesarray[2][$i]));
echo '<pre>', print_r($out), '</pre>';
?> 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.