Jump to content

Archived

This topic is now archived and is closed to further replies.

php_n00b

Extract data between two tags

Recommended Posts

I am parsing, screen-scraping if you will a website and need to extract the following:

 

<div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td>
            </tr></tbody></table></div>

 

There are multiple instances of the DIV like this (same class ID)

What I'd like is CSV output like this

PRODUCT NAME, -(SKU), PRICE

 

 

What I've tried and does not work:

<?php

/**
*
* @get text between tags
*
* @param string $tag The tag name
*
* @param string $html The XML or XHTML string
*
* @param int $strict Whether to use strict mode
*
* @return array
*
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
    /*** a new dom object ***/
    $dom = new domDocument;

    /*** load the html into the object ***/
    if($strict==1)
    {
        $dom->loadXML($html);
    }
    else
    {
        $dom->loadHTML($html);
    }

    /*** discard white space ***/
    $dom->preserveWhiteSpace = false;

    /*** the tag by its tag name ***/
    $content = $dom->getElementsByTagname($tag);

    /*** the array to return ***/
    $out = array();
    foreach ($content as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }
    /*** return the results ***/
    return $out;
}


function getTags( $dom, $tagName, $attrName, $attrValue ){
    $html = '';
    $domxpath = new DOMXPath($dom);
    $newDom = new DOMDocument;
    $newDom->formatOutput = true;

   //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
    // $filtered =  $domxpath->query('//div[@class="className"]');
    // '//' when you don't know 'absolute' path
$filtered = $domxpath->query("*/div[@id='resprodtop']");

    // since above returns DomNodeList Object
    // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you.
    $i = 0;
    while( $myItem = $filtered->item($i++) ){
        $node = $newDom->importNode( $myItem, true );    // import node
        $newDom->appendChild($node);                    // append node
    }
    $html = $newDom->saveHTML();
    return $html;
}


$some_link = 'http://www...';
$tagname = 'div';
$attrName = 'class';
$attrValue = 'resprodtop';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

//If using domxpath

//getTags( $dom, $tagName, $attrName, $attrValue );
//echo $html;

//If using gettextbetweentags


$string = file_get_contents('http://www...');

$content = getTextBetweenTags('div class="resprodtop"', $string, 1);

foreach( $content as $item )
{
    echo $item.'<br />';
}
///


?>

 

With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

 

Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25

 

 

 

Share this post


Link to post
Share on other sites

Use cURL instead with regex, it's unbeatable! loadXML is a very crappy solution.

 

Dunno how to output it to CSV but as pure string it works perfectly:

 

<?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product)
	echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], '<br>';
?>

 

Share this post


Link to post
Share on other sites

Is loadXML generally a rubbish method or just the way I was using it?

 

That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.

 

I haven't worked out how to do this yet, but it does save a CSV file properly now.

 

 

Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?

If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?

 

 

Share this post


Link to post
Share on other sites

It won't prompt to save the file. Just use the APPEND flag when writing and it will add it to the end of the CSV file.

 

Yeah, loadXML, DOM and XPath classes are rather useless. Regex is precise and quick and very easy to maintain.

 

Where do you have comma in your div? Paste example link and I'll help you out!

Share this post


Link to post
Share on other sites

Hi Silkfire.

 

ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87

                                      ^ comma was here

 

"Advantage Small Cats, Small Dogs & Pet Rabbits"

 

                                                               

Share this post


Link to post
Share on other sites

How can I assign this to a variable?

		echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n";

 

If I can do that, then I should be able to get it to loop through the list of urls.

 

 

Many thanks

Share this post


Link to post
Share on other sites

Wrong approach, the cURL has to loop through the URLs, not the regex. Hmmm I'm having trouble finding out why it aint working with that comma, could you sneak in another url so i could improve the matching?

Share this post


Link to post
Share on other sites

Are you getting the same results as me?

It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.

Other urls would just be:

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25

etc

------------------------------------------------

Looping through URL list

 

Example line from CSV

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211

Text to grab:

<div id="totalprice_div" style="font-size: medium;">Total Price: 
		£0.26			</div>

 

<title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>

 

Expected output (doesn't matter too much about order of columns)

 

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

 

 

My effort so far (regex is wrong though - I don't really get regex:

<?php

// basic setup
$in  = 'urls.csv'; 
$out = 'myCSV.csv'; 
$fpo = fopen($out, 'w'); 
$fpi = fopen($in, 'r');
if (!$fpi) die("$in BROKE"); 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $curlurl);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

//read input

while (!feof($fpi))
{	
$data = fgetcsv($fpi); 
$curlurl =	$data [1]
$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);
    
foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product) {
        //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`;
	//$csv_data [2] =  $data [2]
	//fputcsv ($fpo,$csv_data)

	  }
//end while loop
}


?>

 

 

Share this post


Link to post
Share on other sites

You can choose another delimiter, use an unusual character like '|' (pipe character) for example. Is there such an option in the CSV function?

 

I'll get on as soon as I have time, mate.

 

/s

Share this post


Link to post
Share on other sites

Hey do you want the resulting CSV to be:

 

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

 

or

 

ADVANTAGE FOR LARGE CATS AND RABBITS    -(BAADV19)    £14.46

 

?

 

I'm almost done with the script matey.

Share this post


Link to post
Share on other sites

I want both please, - they're two different sites  :shy:

 

The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)

The viovet one  is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)

Share this post


Link to post
Share on other sites

What do you mean? What do you want in the CSV? Having a hard time to understand what it is you want. I've exluded the pound sign from the price tag by the way.

 

Please test this code. Make a text file called urls.txt with 1 url on each line. The script can take some time to execute because it doesnt read the sites in paralell but one by one instead (it's more advanced if you want it asynchrous).

 

<div style="font-family: Arial">
<?
$urls = file_get_contents('urls.txt');
$urls = explode("\n", $urls);

echo '<pre>', print_r($urls, true), '</pre>';

	$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$fh = fopen('products.csv', 'w');

foreach($urls as $url) {
	curl_setopt($ch, CURLOPT_URL, $url);

	$products = curl_exec($ch);

	preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

	foreach($matches as &$matchgroup) {
		foreach($matchgroup as &$match)
			$match = strip_tags($match);
	}

	//echo '<pre>', print_r($matches, true), '</pre>';

	foreach($matches as &$product) {
		$product = array_slice($product, 1);
		echo strtoupper($product[0]), ', -(', $product[1], '), £', $product[2], '<br>';

		fputcsv($fh, $product, ';');
	}
}

fclose($fh);
?>
</div>

Share this post


Link to post
Share on other sites

The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine

 PRODUCT NAME, -(SKU), PRICE 

 

 

Viovet needs to output  like this to csv,

 

Output:

URL, SKU, PRODUCT NAME, PRICE
http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26

 

and read from a list of urls like this

http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html

 

I've just looked and some pages do have multiple matches on them e.g.

http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html

Share this post


Link to post
Share on other sites

Why do you have viovet with so many periods? http://www.vio...v.e.t.co.uk

Do you want to run the same urls and spit the results out for bestpet first, then viovet. Is that two different csv, am I udnerstanding this right?

Share this post


Link to post
Share on other sites

Because i don't want this being found by the company in question (google indexing etc)

 

It's two different input/output files and two different regex matches I believe.

 

 

Share this post


Link to post
Share on other sites

Are you afraid that viovet will find out you're indexing them? They'll see that when you load the page from their site...

Share this post


Link to post
Share on other sites

@silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed.

Share this post


Link to post
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.