TGWSE_GY Posted April 3, 2011 Share Posted April 3, 2011 So I have an interesting one for you guys this AM, I first want to make it very clear that I am not scraping code, rather I am scraping data that is needed to import into a shopping cart system for someone. I have a URL that I am trying to scrape required data off of, however it is not returning all the data that I want. I have created a function that uses preg_match_all() and regex and I am still having issues striping what I want. here is a link to my test what I am wanting to strip from http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html I am wanting to grab all this data: Item Number: K-686 Style : Shiraz Province : Fars Made In : Iran Foundation : Wool Pile : 100% Wool Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange Size (feet) : 4' 11" x 3' 4" Size (Centimeter) : 155 x 103 Age : 20-25 Years Old Condition : Very Good KPSI (knots per sq. inch) : 130 knots per square inch Woven : Hand Knotted Shipping and Handling : Free Shipping(For Mainland USA) Est. Retail Value : $2,700.00 Here is the code note that $url holds the link above. $html = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $html = str_replace($newlinews, "", html_entity_decode($html)); preg_match_all('/<tr><td width="50%" align="right"><font color="#800000"><b>[^\s ](.*?)<\/b><\/font><\/td><td width="50%" align="left">[^\s ](.*?)<\/td><\/tr>/', $html, $matches, PREG_SET_ORDER); foreach($matches_label as $match){ $count = 0; echo $match[$count]; echo "<br>"; $count++; } echo $count; This returns the following Style : Shiraz Province : Fars Foundation : Wool Colors : Red, Navy Blue, Ivory, Forest Green, Light Blue, Orange Size (feet) : 4' 11" x 3' 4" Size (Centimeter) : 155 x 103 Age : 20-25 Years Old Condition : Very Good Est. Retail Value : $2,700.00 1 it is missing: Inventory Number : xxxxxxx Made In: xxxxxxxx Pile : xxxxxxxxxx KPSI(Knots Per Inch) : xxxxxxxxxx Woven : xxxxxxxxx Shopping : xxxxxxxxxxx You can see the script in action here -> http://visualrealityink.com/dev/clients/rug_src/scrapeing/scrape_tst.php Thanks in advance for all of your help Quote Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/ Share on other sites More sharing options...
sasa Posted April 3, 2011 Share Posted April 3, 2011 try <?php $url = 'http://visualrealityink.com/dev/clients/rug_src/scrapeing/Rugsource/www.vendio.com/stores/Rugsource1/item/other/tribal-wool-3x5-shiraz-persian/lid=10363581.html'; $file = file_get_contents($url); $pattern = '/<b>([^<:]+)\s*: (.*?)<\/td><\/tr>/s'; preg_match_all($pattern, $file, $matchesarray); foreach($matchesarray[1] as $i => $key) $out[trim($key)] = trim(strip_tags ($matchesarray[2][$i])); echo '<pre>', print_r($out), '</pre>'; ?> Quote Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/#findComment-1196163 Share on other sites More sharing options...
TGWSE_GY Posted April 3, 2011 Author Share Posted April 3, 2011 sasa thank you so much as usuall your advice and code are flawless! Quote Link to comment https://forums.phpfreaks.com/topic/232544-parsing-html-to-strip-required-data/#findComment-1196287 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.