Extract data between two tags

php_n00b · February 15, 2011

I am parsing, screen-scraping if you will a website and need to extract the following:

<div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td>
            </tr></tbody></table></div>

There are multiple instances of the DIV like this (same class ID)

What I'd like is CSV output like this

PRODUCT NAME, -(SKU), PRICE

What I've tried and does not work:

<?php

/**
*
* @get text between tags
*
* @param string $tag The tag name
*
* @param string $html The XML or XHTML string
*
* @param int $strict Whether to use strict mode
*
* @return array
*
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
    /*** a new dom object ***/
    $dom = new domDocument;

    /*** load the html into the object ***/
    if($strict==1)
    {
        $dom->loadXML($html);
    }
    else
    {
        $dom->loadHTML($html);
    }

    /*** discard white space ***/
    $dom->preserveWhiteSpace = false;

    /*** the tag by its tag name ***/
    $content = $dom->getElementsByTagname($tag);

    /*** the array to return ***/
    $out = array();
    foreach ($content as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }
    /*** return the results ***/
    return $out;
}


function getTags( $dom, $tagName, $attrName, $attrValue ){
    $html = '';
    $domxpath = new DOMXPath($dom);
    $newDom = new DOMDocument;
    $newDom->formatOutput = true;

   //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
    // $filtered =  $domxpath->query('//div[@class="className"]');
    // '//' when you don't know 'absolute' path
$filtered = $domxpath->query("*/div[@id='resprodtop']");

    // since above returns DomNodeList Object
    // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you.
    $i = 0;
    while( $myItem = $filtered->item($i++) ){
        $node = $newDom->importNode( $myItem, true );    // import node
        $newDom->appendChild($node);                    // append node
    }
    $html = $newDom->saveHTML();
    return $html;
}


$some_link = 'http://www...';
$tagname = 'div';
$attrName = 'class';
$attrValue = 'resprodtop';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

//If using domxpath

//getTags( $dom, $tagName, $attrName, $attrValue );
//echo $html;

//If using gettextbetweentags


$string = file_get_contents('http://www...');

$content = getTextBetweenTags('div class="resprodtop"', $string, 1);

foreach( $content as $item )
{
    echo $item.'<br />';
}
///


?>

With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25

silkfire · February 17, 2011

Use cURL instead with regex, it's unbeatable! loadXML is a very crappy solution.

Dunno how to output it to CSV but as pure string it works perfectly:

<?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product)
	echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], '<br>';
?>

php_n00b · February 21, 2011

Is loadXML generally a rubbish method or just the way I was using it?

That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.

I haven't worked out how to do this yet, but it does save a CSV file properly now.

Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?

If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?

silkfire · February 21, 2011

It won't prompt to save the file. Just use the APPEND flag when writing and it will add it to the end of the CSV file.

Yeah, loadXML, DOM and XPath classes are rather useless. Regex is precise and quick and very easy to maintain.

Where do you have comma in your div? Paste example link and I'll help you out!

php_n00b · February 22, 2011

Hi Silkfire.

ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87

^ comma was here

"Advantage Small Cats, Small Dogs & Pet Rabbits"

php_n00b · February 22, 2011

How can I assign this to a variable?

		echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n";

If I can do that, then I should be able to get it to loop through the list of urls.

Many thanks

silkfire · February 22, 2011

Wrong approach, the cURL has to loop through the URLs, not the regex. Hmmm I'm having trouble finding out why it aint working with that comma, could you sneak in another url so i could improve the matching?

php_n00b · February 22, 2011

Are you getting the same results as me?

It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.

Other urls would just be:

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25

etc

------------------------------------------------

Looping through URL list

Example line from CSV

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211

Text to grab:

<div id="totalprice_div" style="font-size: medium;">Total Price: 
		£0.26			</div>

<title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>

Expected output (doesn't matter too much about order of columns)

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

My effort so far (regex is wrong though - I don't really get regex:

<?php

// basic setup
$in  = 'urls.csv'; 
$out = 'myCSV.csv'; 
$fpo = fopen($out, 'w'); 
$fpi = fopen($in, 'r');
if (!$fpi) die("$in BROKE"); 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $curlurl);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

//read input

while (!feof($fpi))
{	
$data = fgetcsv($fpi); 
$curlurl =	$data [1]
$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);
    
foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product) {
        //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`;
	//$csv_data [2] =  $data [2]
	//fputcsv ($fpo,$csv_data)

	  }
//end while loop
}


?>

silkfire · February 22, 2011

You can choose another delimiter, use an unusual character like '|' (pipe character) for example. Is there such an option in the CSV function?

I'll get on as soon as I have time, mate.

/s

silkfire · February 23, 2011

Hey do you want the resulting CSV to be:

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

or

ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

?

I'm almost done with the script matey.

php_n00b · February 23, 2011

I want both please, - they're two different sites :shy:

The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)

The viovet one is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)

silkfire · February 23, 2011

What do you mean? What do you want in the CSV? Having a hard time to understand what it is you want. I've exluded the pound sign from the price tag by the way.

Please test this code. Make a text file called urls.txt with 1 url on each line. The script can take some time to execute because it doesnt read the sites in paralell but one by one instead (it's more advanced if you want it asynchrous).

<div style="font-family: Arial">
<?
$urls = file_get_contents('urls.txt');
$urls = explode("\n", $urls);

echo '<pre>', print_r($urls, true), '</pre>';

	$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$fh = fopen('products.csv', 'w');

foreach($urls as $url) {
	curl_setopt($ch, CURLOPT_URL, $url);

	$products = curl_exec($ch);

	preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

	foreach($matches as &$matchgroup) {
		foreach($matchgroup as &$match)
			$match = strip_tags($match);
	}

	//echo '<pre>', print_r($matches, true), '</pre>';

	foreach($matches as &$product) {
		$product = array_slice($product, 1);
		echo strtoupper($product[0]), ', -(', $product[1], '), £', $product[2], '<br>';

		fputcsv($fh, $product, ';');
	}
}

fclose($fh);
?>
</div>

php_n00b · February 25, 2011

The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine

 PRODUCT NAME, -(SKU), PRICE

Viovet needs to output like this to csv,

Output:

URL, SKU, PRODUCT NAME, PRICE
http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26

and read from a list of urls like this

http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html

I've just looked and some pages do have multiple matches on them e.g.

http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html

php_n00b · February 25, 2011

Thanks - that works fine for be.st.pet but just need to get a solution for viovet now.

silkfire · February 25, 2011

Why do you have viovet with so many periods? http://www.vio...v.e.t.co.uk

Do you want to run the same urls and spit the results out for bestpet first, then viovet. Is that two different csv, am I udnerstanding this right?

php_n00b · February 25, 2011

Because i don't want this being found by the company in question (google indexing etc)

It's two different input/output files and two different regex matches I believe.

silkfire · February 25, 2011

Are you afraid that viovet will find out you're indexing them? They'll see that when you load the page from their site...

php_n00b · February 28, 2011

@silkfire - not really I'm only requesting one copy of each page, so it's hardly going to get noticed.

silkfire · February 28, 2011

I hope it all worked out.

php_n00b · February 28, 2011

Cheers Silkfire - did you get the viovet regex worked out for me?

Sign In

Extract data between two tags

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information