Jump to content

Parsing HTML to strip required data


TGWSE_GY

Recommended Posts

So I have an interesting one for you guys this AM,

I first want to make it very clear that I am not scraping code, rather I am scraping data that is needed to import into a shopping cart system for someone.

I have a URL that I am trying to scrape required data off of, however it is not returning all the data that I want. I have created a function that uses preg_match_all() and regex and I am still having issues striping what I want.

 

here is a link to my test what I am wanting to strip from http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html

 

I am wanting to grab all this data:

 

Item Number: 

K-686

Style :  Shiraz

Province :  Fars

Made In :  Iran

Foundation :  Wool

Pile :  100% Wool

Colors :  Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) :  4' 11" x  3' 4"

Size (Centimeter) :  155 x 103

Age :  20-25 Years Old

Condition :  Very Good

KPSI (knots per sq. inch) :  130 knots per square inch

Woven :  Hand Knotted

Shipping and Handling :  Free Shipping(For Mainland USA)

Est. Retail Value :  $2,700.00

 

Here is the code note that $url holds the link above.

$html = file_get_contents($url);

	$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

	$html = str_replace($newlinews, "", html_entity_decode($html));
	preg_match_all('/<tr><td width="50%" align="right"><font color="#800000"><b>[^\s ](.*?)<\/b><\/font><\/td><td width="50%" align="left">[^\s ](.*?)<\/td><\/tr>/', $html, $matches, PREG_SET_ORDER);

	foreach($matches_label as $match){
			$count = 0;
			echo $match[$count];
			echo "<br>";
			$count++;

	}
	echo $count;

 

This returns the following

Style :  Shiraz

Province :  Fars

Foundation :  Wool

Colors :  Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange

Size (feet) :  4' 11" x  3' 4"

Size (Centimeter) :  155 x 103

Age :  20-25 Years Old

Condition :  Very Good

Est. Retail Value :  $2,700.00

1

 

it is missing:

Inventory Number : xxxxxxx

Made In: xxxxxxxx

Pile : xxxxxxxxxx

KPSI(Knots Per Inch) : xxxxxxxxxx

Woven : xxxxxxxxx

Shopping : xxxxxxxxxxx

 

You can see the script in action here -> http://visualrealityink.com/dev/clients/rug_src/scrapeing/scrape_tst.php

 

Thanks in advance for all of your help :)

Link to comment
Share on other sites

try

<?php
$url = 'http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html';
$file = file_get_contents($url);
$pattern = '/<b>([^<:]+)\s*:  (.*?)<\/td><\/tr>/s';
preg_match_all($pattern, $file, $matchesarray);
foreach($matchesarray[1] as $i => $key) $out[trim($key)] = trim(strip_tags ($matchesarray[2][$i]));
echo '<pre>', print_r($out), '</pre>';
?> 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.