Jump to content

php_n00b

Members
  • Posts

    11
  • Joined

  • Last visited

    Never

Profile Information

  • Gender
    Not Telling

php_n00b's Achievements

Newbie

Newbie (1/5)

0

Reputation

  1. Cheers Silkfire - did you get the viovet regex worked out for me?
  2. @silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed.
  3. Because i don't want this being found by the company in question (google indexing etc) It's two different input/output files and two different regex matches I believe.
  4. Thanks - that works fine for be.st.pet but just need to get a solution for viovet now.
  5. The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine PRODUCT NAME, -(SKU), PRICE Viovet needs to output like this to csv, Output: URL, SKU, PRODUCT NAME, PRICE http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26 and read from a list of urls like this http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html I've just looked and some pages do have multiple matches on them e.g. http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
  6. I want both please, - they're two different sites The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL) The viovet one is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)
  7. Are you getting the same results as me? It does work, but as it's a CSV it makes an extra column and splits the columns at that comma. Other urls would just be: http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25 http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25 etc ------------------------------------------------ Looping through URL list Example line from CSV http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211 Text to grab: <div id="totalprice_div" style="font-size: medium;">Total Price: £0.26 </div> <title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title> Expected output (doesn't matter too much about order of columns) http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26 My effort so far (regex is wrong though - I don't really get regex: <?php // basic setup $in = 'urls.csv'; $out = 'myCSV.csv'; $fpo = fopen($out, 'w'); $fpi = fopen($in, 'r'); if (!$fpi) die("$in BROKE"); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $curlurl); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //read input while (!feof($fpi)) { $data = fgetcsv($fpi); $curlurl = $data [1] $products = curl_exec($ch); preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER); foreach($matches as &$matchgroup) { foreach($matchgroup as &$match) $match = strip_tags($match); } foreach($matches as $product) { //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`; //$csv_data [2] = $data [2] //fputcsv ($fpo,$csv_data) } //end while loop } ?>
  8. How can I assign this to a variable? echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"; If I can do that, then I should be able to get it to loop through the list of urls. Many thanks
  9. Hi Silkfire. ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46 ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87 ^ comma was here "Advantage Small Cats, Small Dogs & Pet Rabbits"
  10. Is loadXML generally a rubbish method or just the way I was using it? That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused. I haven't worked out how to do this yet, but it does save a CSV file properly now. Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time? If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?
  11. I am parsing, screen-scraping if you will a website and need to extract the following: <div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td> </tr></tbody></table></div> There are multiple instances of the DIV like this (same class ID) What I'd like is CSV output like this PRODUCT NAME, -(SKU), PRICE What I've tried and does not work: <?php /** * * @get text between tags * * @param string $tag The tag name * * @param string $html The XML or XHTML string * * @param int $strict Whether to use strict mode * * @return array * */ function getTextBetweenTags($tag, $html, $strict=0) { /*** a new dom object ***/ $dom = new domDocument; /*** load the html into the object ***/ if($strict==1) { $dom->loadXML($html); } else { $dom->loadHTML($html); } /*** discard white space ***/ $dom->preserveWhiteSpace = false; /*** the tag by its tag name ***/ $content = $dom->getElementsByTagname($tag); /*** the array to return ***/ $out = array(); foreach ($content as $item) { /*** add node value to the out array ***/ $out[] = $item->nodeValue; } /*** return the results ***/ return $out; } function getTags( $dom, $tagName, $attrName, $attrValue ){ $html = ''; $domxpath = new DOMXPath($dom); $newDom = new DOMDocument; $newDom->formatOutput = true; //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']"); // $filtered = $domxpath->query('//div[@class="className"]'); // '//' when you don't know 'absolute' path $filtered = $domxpath->query("*/div[@id='resprodtop']"); // since above returns DomNodeList Object // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you. $i = 0; while( $myItem = $filtered->item($i++) ){ $node = $newDom->importNode( $myItem, true ); // import node $newDom->appendChild($node); // append node } $html = $newDom->saveHTML(); return $html; } $some_link = 'http://www...'; $tagname = 'div'; $attrName = 'class'; $attrValue = 'resprodtop'; $dom = new DOMDocument; $dom->preserveWhiteSpace = false; @$dom->loadHTMLFile($some_link); //If using domxpath //getTags( $dom, $tagName, $attrName, $attrValue ); //echo $html; //If using gettextbetweentags $string = file_get_contents('http://www...'); $content = getTextBetweenTags('div class="resprodtop"', $string, 1); foreach( $content as $item ) { echo $item.'<br />'; } /// ?> With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like: Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.