php_n00b Posted February 15, 2011 Share Posted February 15, 2011 I am parsing, screen-scraping if you will a website and need to extract the following: <div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td> </tr></tbody></table></div> There are multiple instances of the DIV like this (same class ID) What I'd like is CSV output like this PRODUCT NAME, -(SKU), PRICE What I've tried and does not work: <?php /** * * @get text between tags * * @param string $tag The tag name * * @param string $html The XML or XHTML string * * @param int $strict Whether to use strict mode * * @return array * */ function getTextBetweenTags($tag, $html, $strict=0) { /*** a new dom object ***/ $dom = new domDocument; /*** load the html into the object ***/ if($strict==1) { $dom->loadXML($html); } else { $dom->loadHTML($html); } /*** discard white space ***/ $dom->preserveWhiteSpace = false; /*** the tag by its tag name ***/ $content = $dom->getElementsByTagname($tag); /*** the array to return ***/ $out = array(); foreach ($content as $item) { /*** add node value to the out array ***/ $out[] = $item->nodeValue; } /*** return the results ***/ return $out; } function getTags( $dom, $tagName, $attrName, $attrValue ){ $html = ''; $domxpath = new DOMXPath($dom); $newDom = new DOMDocument; $newDom->formatOutput = true; //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']"); // $filtered = $domxpath->query('//div[@class="className"]'); // '//' when you don't know 'absolute' path $filtered = $domxpath->query("*/div[@id='resprodtop']"); // since above returns DomNodeList Object // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you. $i = 0; while( $myItem = $filtered->item($i++) ){ $node = $newDom->importNode( $myItem, true ); // import node $newDom->appendChild($node); // append node } $html = $newDom->saveHTML(); return $html; } $some_link = 'http://www...'; $tagname = 'div'; $attrName = 'class'; $attrValue = 'resprodtop'; $dom = new DOMDocument; $dom->preserveWhiteSpace = false; @$dom->loadHTMLFile($some_link); //If using domxpath //getTags( $dom, $tagName, $attrName, $attrValue ); //echo $html; //If using gettextbetweentags $string = file_get_contents('http://www...'); $content = getTextBetweenTags('div class="resprodtop"', $string, 1); foreach( $content as $item ) { echo $item.'<br />'; } /// ?> With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like: Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25 Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 17, 2011 Share Posted February 17, 2011 Use cURL instead with regex, it's unbeatable! loadXML is a very crappy solution. Dunno how to output it to CSV but as pure string it works perfectly: <? $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25'); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $products = curl_exec($ch); preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER); foreach($matches as &$matchgroup) { foreach($matchgroup as &$match) $match = strip_tags($match); } foreach($matches as $product) echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], '<br>'; ?> Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 21, 2011 Author Share Posted February 21, 2011 Is loadXML generally a rubbish method or just the way I was using it? That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused. I haven't worked out how to do this yet, but it does save a CSV file properly now. Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time? If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file? Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 21, 2011 Share Posted February 21, 2011 It won't prompt to save the file. Just use the APPEND flag when writing and it will add it to the end of the CSV file. Yeah, loadXML, DOM and XPath classes are rather useless. Regex is precise and quick and very easy to maintain. Where do you have comma in your div? Paste example link and I'll help you out! Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 22, 2011 Author Share Posted February 22, 2011 Hi Silkfire. ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46 ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87 ^ comma was here "Advantage Small Cats, Small Dogs & Pet Rabbits" Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 22, 2011 Author Share Posted February 22, 2011 How can I assign this to a variable? echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"; If I can do that, then I should be able to get it to loop through the list of urls. Many thanks Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 22, 2011 Share Posted February 22, 2011 Wrong approach, the cURL has to loop through the URLs, not the regex. Hmmm I'm having trouble finding out why it aint working with that comma, could you sneak in another url so i could improve the matching? Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 22, 2011 Author Share Posted February 22, 2011 Are you getting the same results as me? It does work, but as it's a CSV it makes an extra column and splits the columns at that comma. Other urls would just be: http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25 http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25 etc ------------------------------------------------ Looping through URL list Example line from CSV http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211 Text to grab: <div id="totalprice_div" style="font-size: medium;">Total Price: £0.26 </div> <title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title> Expected output (doesn't matter too much about order of columns) http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26 My effort so far (regex is wrong though - I don't really get regex: <?php // basic setup $in = 'urls.csv'; $out = 'myCSV.csv'; $fpo = fopen($out, 'w'); $fpi = fopen($in, 'r'); if (!$fpi) die("$in BROKE"); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $curlurl); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //read input while (!feof($fpi)) { $data = fgetcsv($fpi); $curlurl = $data [1] $products = curl_exec($ch); preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER); foreach($matches as &$matchgroup) { foreach($matchgroup as &$match) $match = strip_tags($match); } foreach($matches as $product) { //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`; //$csv_data [2] = $data [2] //fputcsv ($fpo,$csv_data) } //end while loop } ?> Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 22, 2011 Share Posted February 22, 2011 You can choose another delimiter, use an unusual character like '|' (pipe character) for example. Is there such an option in the CSV function? I'll get on as soon as I have time, mate. /s Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 23, 2011 Share Posted February 23, 2011 Hey do you want the resulting CSV to be: http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26 or ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46 ? I'm almost done with the script matey. Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 23, 2011 Author Share Posted February 23, 2011 I want both please, - they're two different sites The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL) The viovet one is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL) Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 23, 2011 Share Posted February 23, 2011 What do you mean? What do you want in the CSV? Having a hard time to understand what it is you want. I've exluded the pound sign from the price tag by the way. Please test this code. Make a text file called urls.txt with 1 url on each line. The script can take some time to execute because it doesnt read the sites in paralell but one by one instead (it's more advanced if you want it asynchrous). <div style="font-family: Arial"> <? $urls = file_get_contents('urls.txt'); $urls = explode("\n", $urls); echo '<pre>', print_r($urls, true), '</pre>'; $ch = curl_init(); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $fh = fopen('products.csv', 'w'); foreach($urls as $url) { curl_setopt($ch, CURLOPT_URL, $url); $products = curl_exec($ch); preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER); foreach($matches as &$matchgroup) { foreach($matchgroup as &$match) $match = strip_tags($match); } //echo '<pre>', print_r($matches, true), '</pre>'; foreach($matches as &$product) { $product = array_slice($product, 1); echo strtoupper($product[0]), ', -(', $product[1], '), £', $product[2], '<br>'; fputcsv($fh, $product, ';'); } } fclose($fh); ?> </div> Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 25, 2011 Author Share Posted February 25, 2011 The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine PRODUCT NAME, -(SKU), PRICE Viovet needs to output like this to csv, Output: URL, SKU, PRODUCT NAME, PRICE http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26 and read from a list of urls like this http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html I've just looked and some pages do have multiple matches on them e.g. http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 25, 2011 Author Share Posted February 25, 2011 Thanks - that works fine for be.st.pet but just need to get a solution for viovet now. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 25, 2011 Share Posted February 25, 2011 Why do you have viovet with so many periods? http://www.vio...v.e.t.co.uk Do you want to run the same urls and spit the results out for bestpet first, then viovet. Is that two different csv, am I udnerstanding this right? Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 25, 2011 Author Share Posted February 25, 2011 Because i don't want this being found by the company in question (google indexing etc) It's two different input/output files and two different regex matches I believe. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 25, 2011 Share Posted February 25, 2011 Are you afraid that viovet will find out you're indexing them? They'll see that when you load the page from their site... Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 28, 2011 Author Share Posted February 28, 2011 @silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 28, 2011 Share Posted February 28, 2011 I hope it all worked out. Quote Link to comment Share on other sites More sharing options...
php_n00b Posted February 28, 2011 Author Share Posted February 28, 2011 Cheers Silkfire - did you get the viovet regex worked out for me? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.