php_n00b
-
Posts
11 -
Joined
-
Last visited
Never
Posts posted by php_n00b
-
-
@silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed.
-
Because i don't want this being found by the company in question (google indexing etc)
It's two different input/output files and two different regex matches I believe.
-
Thanks - that works fine for be.st.pet but just need to get a solution for viovet now.
-
The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine
PRODUCT NAME, -(SKU), PRICE
Viovet needs to output like this to csv,
Output:
URL, SKU, PRODUCT NAME, PRICE http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26
and read from a list of urls like this
http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html
I've just looked and some pages do have multiple matches on them e.g.
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
-
I want both please, - they're two different sites
The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)
The viovet one is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)
-
Are you getting the same results as me?
It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.
Other urls would just be:
etc
------------------------------------------------
Looping through URL list
Example line from CSV
http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211
Text to grab:
<div id="totalprice_div" style="font-size: medium;">Total Price: £0.26 </div>
<title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>
Expected output (doesn't matter too much about order of columns)
http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26
My effort so far (regex is wrong though - I don't really get regex:
<?php // basic setup $in = 'urls.csv'; $out = 'myCSV.csv'; $fpo = fopen($out, 'w'); $fpi = fopen($in, 'r'); if (!$fpi) die("$in BROKE"); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $curlurl); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //read input while (!feof($fpi)) { $data = fgetcsv($fpi); $curlurl = $data [1] $products = curl_exec($ch); preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER); foreach($matches as &$matchgroup) { foreach($matchgroup as &$match) $match = strip_tags($match); } foreach($matches as $product) { //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`; //$csv_data [2] = $data [2] //fputcsv ($fpo,$csv_data) } //end while loop } ?>
-
How can I assign this to a variable?
echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n";
If I can do that, then I should be able to get it to loop through the list of urls.
Many thanks
-
Hi Silkfire.
ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46
ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87
^ comma was here
"Advantage Small Cats, Small Dogs & Pet Rabbits"
-
Is loadXML generally a rubbish method or just the way I was using it?
That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.
I haven't worked out how to do this yet, but it does save a CSV file properly now.
Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?
If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?
-
I am parsing, screen-scraping if you will a website and need to extract the following:
<div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td> </tr></tbody></table></div>
There are multiple instances of the DIV like this (same class ID)
What I'd like is CSV output like this
PRODUCT NAME, -(SKU), PRICE
What I've tried and does not work:
<?php /** * * @get text between tags * * @param string $tag The tag name * * @param string $html The XML or XHTML string * * @param int $strict Whether to use strict mode * * @return array * */ function getTextBetweenTags($tag, $html, $strict=0) { /*** a new dom object ***/ $dom = new domDocument; /*** load the html into the object ***/ if($strict==1) { $dom->loadXML($html); } else { $dom->loadHTML($html); } /*** discard white space ***/ $dom->preserveWhiteSpace = false; /*** the tag by its tag name ***/ $content = $dom->getElementsByTagname($tag); /*** the array to return ***/ $out = array(); foreach ($content as $item) { /*** add node value to the out array ***/ $out[] = $item->nodeValue; } /*** return the results ***/ return $out; } function getTags( $dom, $tagName, $attrName, $attrValue ){ $html = ''; $domxpath = new DOMXPath($dom); $newDom = new DOMDocument; $newDom->formatOutput = true; //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']"); // $filtered = $domxpath->query('//div[@class="className"]'); // '//' when you don't know 'absolute' path $filtered = $domxpath->query("*/div[@id='resprodtop']"); // since above returns DomNodeList Object // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you. $i = 0; while( $myItem = $filtered->item($i++) ){ $node = $newDom->importNode( $myItem, true ); // import node $newDom->appendChild($node); // append node } $html = $newDom->saveHTML(); return $html; } $some_link = 'http://www...'; $tagname = 'div'; $attrName = 'class'; $attrValue = 'resprodtop'; $dom = new DOMDocument; $dom->preserveWhiteSpace = false; @$dom->loadHTMLFile($some_link); //If using domxpath //getTags( $dom, $tagName, $attrName, $attrValue ); //echo $html; //If using gettextbetweentags $string = file_get_contents('http://www...'); $content = getTextBetweenTags('div class="resprodtop"', $string, 1); foreach( $content as $item ) { echo $item.'<br />'; } /// ?>
With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:
Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24 Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24
Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25
Extract data between two tags
in Regex Help
Posted
Cheers Silkfire - did you get the viovet regex worked out for me?