Jump to content

php_n00b

Members
  • Posts

    11
  • Joined

  • Last visited

    Never

Posts posted by php_n00b

  1. The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine

     PRODUCT NAME, -(SKU), PRICE 

     

     

    Viovet needs to output  like this to csv,

     

    Output:

    URL, SKU, PRODUCT NAME, PRICE
    http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26
    

     

    and read from a list of urls like this

    http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html
    http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
    http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html
    http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html
    http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html
    
    

     

    I've just looked and some pages do have multiple matches on them e.g.

    http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html

  2. I want both please, - they're two different sites  :shy:

     

    The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)

    The viovet one  is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)

  3. Are you getting the same results as me?

    It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.

    Other urls would just be:

    http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25

    http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25

    etc

    ------------------------------------------------

    Looping through URL list

     

    Example line from CSV

    http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211

    Text to grab:

    <div id="totalprice_div" style="font-size: medium;">Total Price: 
    		£0.26			</div>
    

     

    <title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>
    

     

    Expected output (doesn't matter too much about order of columns)

     

    http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26
    

     

     

    My effort so far (regex is wrong though - I don't really get regex:

    <?php
    
    // basic setup
    $in  = 'urls.csv'; 
    $out = 'myCSV.csv'; 
    $fpo = fopen($out, 'w'); 
    $fpi = fopen($in, 'r');
    if (!$fpi) die("$in BROKE"); 
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $curlurl);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
    //read input
    
    while (!feof($fpi))
    {	
    $data = fgetcsv($fpi); 
    $curlurl =	$data [1]
    $products = curl_exec($ch);
    
    preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);
        
    foreach($matches as &$matchgroup) {
    	foreach($matchgroup as &$match)
    		$match = strip_tags($match);
    }
    
    foreach($matches as $product) {
            //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`;
    	//$csv_data [2] =  $data [2]
    	//fputcsv ($fpo,$csv_data)
    
    	  }
    //end while loop
    }
    
    
    ?>
    
    

     

     

  4. Hi Silkfire.

     

    ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

    ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87

                                          ^ comma was here

     

    "Advantage Small Cats, Small Dogs & Pet Rabbits"

     

                                                                   

  5. Is loadXML generally a rubbish method or just the way I was using it?

     

    That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.

     

    I haven't worked out how to do this yet, but it does save a CSV file properly now.

     

     

    Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?

    If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?

     

     

  6. I am parsing, screen-scraping if you will a website and need to extract the following:

     

    <div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td>
                </tr></tbody></table></div>
    

     

    There are multiple instances of the DIV like this (same class ID)

    What I'd like is CSV output like this

    PRODUCT NAME, -(SKU), PRICE
    

     

     

    What I've tried and does not work:

    <?php
    
    /**
    *
    * @get text between tags
    *
    * @param string $tag The tag name
    *
    * @param string $html The XML or XHTML string
    *
    * @param int $strict Whether to use strict mode
    *
    * @return array
    *
    */
    function getTextBetweenTags($tag, $html, $strict=0)
    {
        /*** a new dom object ***/
        $dom = new domDocument;
    
        /*** load the html into the object ***/
        if($strict==1)
        {
            $dom->loadXML($html);
        }
        else
        {
            $dom->loadHTML($html);
        }
    
        /*** discard white space ***/
        $dom->preserveWhiteSpace = false;
    
        /*** the tag by its tag name ***/
        $content = $dom->getElementsByTagname($tag);
    
        /*** the array to return ***/
        $out = array();
        foreach ($content as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
        /*** return the results ***/
        return $out;
    }
    
    
    function getTags( $dom, $tagName, $attrName, $attrValue ){
        $html = '';
        $domxpath = new DOMXPath($dom);
        $newDom = new DOMDocument;
        $newDom->formatOutput = true;
    
       //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
        // $filtered =  $domxpath->query('//div[@class="className"]');
        // '//' when you don't know 'absolute' path
    $filtered = $domxpath->query("*/div[@id='resprodtop']");
    
        // since above returns DomNodeList Object
        // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you.
        $i = 0;
        while( $myItem = $filtered->item($i++) ){
            $node = $newDom->importNode( $myItem, true );    // import node
            $newDom->appendChild($node);                    // append node
        }
        $html = $newDom->saveHTML();
        return $html;
    }
    
    
    $some_link = 'http://www...';
    $tagname = 'div';
    $attrName = 'class';
    $attrValue = 'resprodtop';
    
    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTMLFile($some_link);
    
    //If using domxpath
    
    //getTags( $dom, $tagName, $attrName, $attrValue );
    //echo $html;
    
    //If using gettextbetweentags
    
    
    $string = file_get_contents('http://www...');
    
    $content = getTextBetweenTags('div class="resprodtop"', $string, 1);
    
    foreach( $content as $item )
    {
        echo $item.'<br />';
    }
    ///
    
    
    ?>

     

    With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:

    Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24
    
    Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24
    
    Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24
    
    Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24
    
    Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

     

    Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25

     

     

     

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.