TGWSE_GY Posted April 3, 2011 Share Posted April 3, 2011 So I have an interesting one for you guys this AM, I first want to make it very clear that I am not scraping code, rather I am scraping data that is needed to import into a shopping cart system for someone. I have a URL that I am trying to scrape required data off of, however it is not returning all the data that I want. I have created a function that uses preg_match_all() and regex and I am still having issues striping what I want. here is a link to my test what I am wanting to strip from http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html I am wanting to grab all this data: Quote Item Number: K-686 Style : Shiraz Province : Fars Made In : Iran Foundation : Wool Pile : 100% Wool Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange Size (feet) : 4' 11" x 3' 4" Size (Centimeter) : 155 x 103 Age : 20-25 Years Old Condition : Very Good KPSI (knots per sq. inch) : 130 knots per square inch Woven : Hand Knotted Shipping and Handling : Free Shipping(For Mainland USA) Est. Retail Value : $2,700.00 Here is the code note that $url holds the link above. $html = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $html = str_replace($newlinews, "", html_entity_decode($html)); preg_match_all('/<tr><td width="50%" align="right"><font color="#800000"><b>[^\s ](.*?)<\/b><\/font><\/td><td width="50%" align="left">[^\s ](.*?)<\/td><\/tr>/', $html, $matches, PREG_SET_ORDER); foreach($matches_label as $match){ $count = 0; echo $match[$count]; echo "<br>"; $count++; } echo $count; This returns the following Quote Style : Shiraz Province : Fars Foundation : Wool Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange Size (feet) : 4' 11" x 3' 4" Size (Centimeter) : 155 x 103 Age : 20-25 Years Old Condition : Very Good Est. Retail Value : $2,700.00 1 it is missing: Quote Inventory Number : xxxxxxx Made In: xxxxxxxx Pile : xxxxxxxxxx KPSI(Knots Per Inch) : xxxxxxxxxx Woven : xxxxxxxxx Shopping : xxxxxxxxxxx You can see the script in action here -> http://visualrealityink.com/dev/clients/rug_src/scrapeing/scrape_tst.php Thanks in advance for all of your help Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/ Share on other sites More sharing options...
sasa Posted April 3, 2011 Share Posted April 3, 2011 try <?php $url = 'http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html'; $file = file_get_contents($url); $pattern = '/<b>([^<:]+)\s*: (.*?)<\/td><\/tr>/s'; preg_match_all($pattern, $file, $matchesarray); foreach($matchesarray[1] as $i => $key) $out[trim($key)] = trim(strip_tags ($matchesarray[2][$i])); echo '<pre>', print_r($out), '</pre>'; ?> Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/#findComment-1196163 Share on other sites More sharing options...
TGWSE_GY Posted April 3, 2011 Author Share Posted April 3, 2011 sasa thank you so much as usuall your advice and code are flawless! Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/#findComment-1196287 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.