php_n00b

February 28, 2011

Cheers Silkfire - did you get the viovet regex worked out for me?

February 28, 2011

@silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed.

February 25, 2011

Because i don't want this being found by the company in question (google indexing etc)

It's two different input/output files and two different regex matches I believe.

February 25, 2011

Thanks - that works fine for be.st.pet but just need to get a solution for viovet now.

February 25, 2011

The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine

 PRODUCT NAME, -(SKU), PRICE

Viovet needs to output like this to csv,

Output:

URL, SKU, PRODUCT NAME, PRICE
http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26

and read from a list of urls like this

http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html

I've just looked and some pages do have multiple matches on them e.g.

http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html

February 23, 2011

I want both please, - they're two different sites :shy:

The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)

The viovet one is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)

February 22, 2011

Are you getting the same results as me?

It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.

Other urls would just be:

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25

etc

------------------------------------------------

Looping through URL list

Example line from CSV

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211

Text to grab:

<div id="totalprice_div" style="font-size: medium;">Total Price: 
		£0.26			</div>

<title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>

Expected output (doesn't matter too much about order of columns)

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

My effort so far (regex is wrong though - I don't really get regex:

<?php

// basic setup
$in  = 'urls.csv'; 
$out = 'myCSV.csv'; 
$fpo = fopen($out, 'w'); 
$fpi = fopen($in, 'r');
if (!$fpi) die("$in BROKE"); 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $curlurl);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

//read input

while (!feof($fpi))
{	
$data = fgetcsv($fpi); 
$curlurl =	$data [1]
$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);
    
foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product) {
        //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`;
	//$csv_data [2] =  $data [2]
	//fputcsv ($fpo,$csv_data)

	  }
//end while loop
}


?>

February 22, 2011

How can I assign this to a variable?

		echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n";

If I can do that, then I should be able to get it to loop through the list of urls.

Many thanks

February 22, 2011

Hi Silkfire.

ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87

^ comma was here

"Advantage Small Cats, Small Dogs & Pet Rabbits"

February 21, 2011

Is loadXML generally a rubbish method or just the way I was using it?

That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.

I haven't worked out how to do this yet, but it does save a CSV file properly now.

Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?

If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?

February 15, 2011

I am parsing, screen-scraping if you will a website and need to extract the following:

<div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td>
            </tr></tbody></table></div>

There are multiple instances of the DIV like this (same class ID)

What I'd like is CSV output like this

PRODUCT NAME, -(SKU), PRICE

What I've tried and does not work:

<?php

/**
*
* @get text between tags
*
* @param string $tag The tag name
*
* @param string $html The XML or XHTML string
*
* @param int $strict Whether to use strict mode
*
* @return array
*
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
    /*** a new dom object ***/
    $dom = new domDocument;

    /*** load the html into the object ***/
    if($strict==1)
    {
        $dom->loadXML($html);
    }
    else
    {
        $dom->loadHTML($html);
    }

    /*** discard white space ***/
    $dom->preserveWhiteSpace = false;

    /*** the tag by its tag name ***/
    $content = $dom->getElementsByTagname($tag);

    /*** the array to return ***/
    $out = array();
    foreach ($content as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }
    /*** return the results ***/
    return $out;
}


function getTags( $dom, $tagName, $attrName, $attrValue ){
    $html = '';
    $domxpath = new DOMXPath($dom);
    $newDom = new DOMDocument;
    $newDom->formatOutput = true;

   //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
    // $filtered =  $domxpath->query('//div[@class="className"]');
    // '//' when you don't know 'absolute' path
$filtered = $domxpath->query("*/div[@id='resprodtop']");

    // since above returns DomNodeList Object
    // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you.
    $i = 0;
    while( $myItem = $filtered->item($i++) ){
        $node = $newDom->importNode( $myItem, true );    // import node
        $newDom->appendChild($node);                    // append node
    }
    $html = $newDom->saveHTML();
    return $html;
}


$some_link = 'http://www...';
$tagname = 'div';
$attrName = 'class';
$attrValue = 'resprodtop';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

//If using domxpath

//getTags( $dom, $tagName, $attrName, $attrValue );
//echo $html;

//If using gettextbetweentags


$string = file_get_contents('http://www...');

$content = getTextBetweenTags('div class="resprodtop"', $string, 1);

foreach( $content as $item )
{
    echo $item.'<br />';
}
///


?>

With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25

Sign In

php_n00b

Posts

Joined

Last visited

Content Type

Profiles

Forums

Posts posted by php_n00b

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Extract data between two tags

Browse

Activity

Important Information