Jump to content

Recommended Posts

I am parsing, screen-scraping if you will a website and need to extract the following:

 

<div class="resprodtop"><table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr valign="middle"><td width="59%"><a href="detailed_product.asp?id=691" class="productname"><b>PRODUCT</b>NAME</a><span class="mainblack"><strong> - (SKU)</strong></span></td><td width="7%" class="maintext">Price:</td><td width="23%" class="productname"><div id="message0">£19.15</div></td><td width="11%" align="right" valign="middle"><a href="detailed_product.asp?id=691" class="mainblack"><strong> Details >></strong></a></td>
            </tr></tbody></table></div>

 

There are multiple instances of the DIV like this (same class ID)

What I'd like is CSV output like this

PRODUCT NAME, -(SKU), PRICE

 

 

What I've tried and does not work:

<?php

/**
*
* @get text between tags
*
* @param string $tag The tag name
*
* @param string $html The XML or XHTML string
*
* @param int $strict Whether to use strict mode
*
* @return array
*
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
    /*** a new dom object ***/
    $dom = new domDocument;

    /*** load the html into the object ***/
    if($strict==1)
    {
        $dom->loadXML($html);
    }
    else
    {
        $dom->loadHTML($html);
    }

    /*** discard white space ***/
    $dom->preserveWhiteSpace = false;

    /*** the tag by its tag name ***/
    $content = $dom->getElementsByTagname($tag);

    /*** the array to return ***/
    $out = array();
    foreach ($content as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }
    /*** return the results ***/
    return $out;
}


function getTags( $dom, $tagName, $attrName, $attrValue ){
    $html = '';
    $domxpath = new DOMXPath($dom);
    $newDom = new DOMDocument;
    $newDom->formatOutput = true;

   //$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
    // $filtered =  $domxpath->query('//div[@class="className"]');
    // '//' when you don't know 'absolute' path
$filtered = $domxpath->query("*/div[@id='resprodtop']");

    // since above returns DomNodeList Object
    // I use following routine to convert it to string(html); copied it from someone's post in this site. Thank you.
    $i = 0;
    while( $myItem = $filtered->item($i++) ){
        $node = $newDom->importNode( $myItem, true );    // import node
        $newDom->appendChild($node);                    // append node
    }
    $html = $newDom->saveHTML();
    return $html;
}


$some_link = 'http://www...';
$tagname = 'div';
$attrName = 'class';
$attrValue = 'resprodtop';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

//If using domxpath

//getTags( $dom, $tagName, $attrName, $attrValue );
//echo $html;

//If using gettextbetweentags


$string = file_get_contents('http://www...');

$content = getTextBetweenTags('div class="resprodtop"', $string, 1);

foreach( $content as $item )
{
    echo $item.'<br />';
}
///


?>

 

With the domxpath I get no output at all, I believe. But with the above code commented I get a list of errors like:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x22 0x2B 0x70 in Entity, line: 46 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity, line: 164 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: EntityRef: expecting ';' in Entity, line: 175 in /home/.../htdocs/test3.php on line 24

 

Page is here: ht tp://w ww.best petpha rma cy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25

 

 

 

Link to comment
https://forums.phpfreaks.com/topic/227763-extract-data-between-two-tags/
Share on other sites

Use cURL instead with regex, it's unbeatable! loadXML is a very crappy solution.

 

Dunno how to output it to CSV but as pure string it works perfectly:

 

<?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=0&search=prescription&sort=&sec=&snum=25');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product)
	echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], '<br>';
?>

 

Is loadXML generally a rubbish method or just the way I was using it?

 

That works fine, except where there is a comma (,) in the actual DIV, how would you extract the comma? Otherwise the CSV file generally gets confused.

 

I haven't worked out how to do this yet, but it does save a CSV file properly now.

 

 

Now I have a text file of URLs I want. So I could just set that up as a loop setting CURLOPT_URL to the variable grabbed from the file each time?

If I do this, I have a feeling it would prompt me to save a file each iteration, so can I just spit strtoupper into an array and then echo it at the end to get the CSV file?

 

 

It won't prompt to save the file. Just use the APPEND flag when writing and it will add it to the end of the CSV file.

 

Yeah, loadXML, DOM and XPath classes are rather useless. Regex is precise and quick and very easy to maintain.

 

Where do you have comma in your div? Paste example link and I'll help you out!

Hi Silkfire.

 

ADVANTAGE FOR LARGE CATS AND RABBITS -(BAADV19) £14.46

ADVANTAGE SMALL CATS SMALL DOGS & PET RABBITS -(BAADV08) £13.87

                                      ^ comma was here

 

"Advantage Small Cats, Small Dogs & Pet Rabbits"

 

                                                               

Are you getting the same results as me?

It does work, but as it's a CSV it makes an extra column and splits the columns at that comma.

Other urls would just be:

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=25&search=prescription&sort=&sec=&snum=25

http://www.bestpetpharmacy.co.uk/search_results.asp?search_type=free_any&start_record=50&search=prescription&sort=&sec=&snum=25

etc

------------------------------------------------

Looping through URL list

 

Example line from CSV

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211

Text to grab:

<div id="totalprice_div" style="font-size: medium;">Total Price: 
		£0.26			</div>

 

<title>ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk)</title>

 

Expected output (doesn't matter too much about order of columns)

 

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

 

 

My effort so far (regex is wrong though - I don't really get regex:

<?php

// basic setup
$in  = 'urls.csv'; 
$out = 'myCSV.csv'; 
$fpo = fopen($out, 'w'); 
$fpi = fopen($in, 'r');
if (!$fpi) die("$in BROKE"); 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $curlurl);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

//read input

while (!feof($fpi))
{	
$data = fgetcsv($fpi); 
$curlurl =	$data [1]
$products = curl_exec($ch);

preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);
    
foreach($matches as &$matchgroup) {
	foreach($matchgroup as &$match)
		$match = strip_tags($match);
}

foreach($matches as $product) {
        //$csv_data [1] = `echo strtoupper($product[1]), ', -(', $product[2], '), £', $product[3], "\n"`;
	//$csv_data [2] =  $data [2]
	//fputcsv ($fpo,$csv_data)

	  }
//end while loop
}


?>

 

 

Hey do you want the resulting CSV to be:

 

http://www.viovet.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at VioVet (VioVet.co.uk), £0.26

 

or

 

ADVANTAGE FOR LARGE CATS AND RABBITS    -(BAADV19)    £14.46

 

?

 

I'm almost done with the script matey.

I want both please, - they're two different sites  :shy:

 

The bestpet one could be given a list of URLs and it extracts all the data from the page (multiple matches, multiple CSV lines per URL)

The viovet one  is feed a list of URLS but only needs to extract a single item per page. (one CSV line per URL)

What do you mean? What do you want in the CSV? Having a hard time to understand what it is you want. I've exluded the pound sign from the price tag by the way.

 

Please test this code. Make a text file called urls.txt with 1 url on each line. The script can take some time to execute because it doesnt read the sites in paralell but one by one instead (it's more advanced if you want it asynchrous).

 

<div style="font-family: Arial">
<?
$urls = file_get_contents('urls.txt');
$urls = explode("\n", $urls);

echo '<pre>', print_r($urls, true), '</pre>';

	$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$fh = fopen('products.csv', 'w');

foreach($urls as $url) {
	curl_setopt($ch, CURLOPT_URL, $url);

	$products = curl_exec($ch);

	preg_match_all('#jpg" alt="(?:<b>)?([^"]+)".*?strong>.*?\(([^)]+)\).*?pound;([^<]+)</td#i', $products, $matches, PREG_SET_ORDER);

	foreach($matches as &$matchgroup) {
		foreach($matchgroup as &$match)
			$match = strip_tags($match);
	}

	//echo '<pre>', print_r($matches, true), '</pre>';

	foreach($matches as &$product) {
		$product = array_slice($product, 1);
		echo strtoupper($product[0]), ', -(', $product[1], '), £', $product[2], '<br>';

		fputcsv($fh, $product, ';');
	}
}

fclose($fh);
?>
</div>

The bestpet one works just fine it needs to just read input from a CSV files of URLs though. The output is fine

 PRODUCT NAME, -(SKU), PRICE 

 

 

Viovet needs to output  like this to csv,

 

Output:

URL, SKU, PRODUCT NAME, PRICE
http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html, p4211, ACP 10mg Tablets » Sold individually at Vi.o.Vet (V.i.o.V.e.t.co.uk), £0.26

 

and read from a list of urls like this

http://www.vio...v.e.t.co.uk/p4211/ACP_10mg_Tablets_-_Sold_individually/product_info.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html
http://www.vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantix/c1_31_375/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advocate_Spot-on_Solution/c1_31_294/category.html
http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Alamycin/c1_31_426/category.html

 

I've just looked and some pages do have multiple matches on them e.g.

http://www..vio...v.e.t.co.uk/Prescription_Drugs-Prescription_Drugs_A_-_C-Advantage/c1_31_39/category.html

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.