Jump to content

Recommended Posts

Hi guys,

 

I've got a data scraping script that is not picking few html elements/fields that I need.

 

I can't get the following elements:

 

product title

price and

specification

 

It could be that the xpath is not correct since the website has a lots of div/tables without any id or class. Please look at the xpath and the source code of the html page to suggest correction.

 

cheers

 

 

HERE IS MY CODE FOR THE ABOVE

				
<?php

// go through each product
$products_nodes = $listing_xpath->query('//div/a[contains(@href, "item-details")]');

// get product details
$prod_title_node 			= $xpath->query('//descendant::h1');				
$prod_price_node 			= $xpath->query('//div[@class="details-subprice"]');
$prod_spec_nodes 			= $xpath->query('//html/body/table/tbody/tr/td/table[2]/tbody/tr/td[3]/div[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table[2]/*');


			/* save specification records */
			$group		= '';
			foreach($prod_spec_nodes as $node)
			{
				if(preg_match('!h[1234]!i', $node->tagName)) {
					$group		= $node->nodeValue;
				}

				$left			= $xpath->query('//td[@class="techspec"]', $node)->item(0);
				$right			= $xpath->query('//td[@class="techvalue"]', $node)->item(0);
				if ($left && $right )
				{
					add_specs(array('productid' => $prodid, 'group' => $group, 'left' => $left->nodeValue, 'right' => $right->nodeValue) );
				}
			}
?>

 

and Here is the extracted specification part source code of the page i need information from, full code of the page is attached.

 

<?php

<tbody><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            General</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Enclosure Type</td><td class="techvalue" width="100%" align="right" valign="top">External
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Device Type</td><td class="techvalue" width="100%" align="right" valign="top">Router
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Width</td><td class="techvalue" width="100%" align="right" valign="top">30 cm
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Depth</td><td class="techvalue" width="100%" align="right" valign="top">19 cm
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Height</td><td class="techvalue" width="100%" align="right" valign="top">4.3 cm
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Weight</td><td class="techvalue" width="100%" align="right" valign="top">1.9 kg
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Expansion / Connectivity</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Interfaces</td><td class="techvalue" width="100%" align="right" valign="top">5 x network node - Ethernet 10Base-T/100Base-TX - RJ-45 ¦  2 x network - Ethernet 10Base-T/100Base-TX - RJ-45 ( WAN ) ¦  1 x management 
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Manufacturer Warranty</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Service & Support Details</td><td class="techvalue" width="100%" align="right" valign="top">2 years warranty
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Service & Support</td><td class="techvalue" width="100%" align="right" valign="top">Limited warranty
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Networking</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Features</td><td class="techvalue" width="100%" align="right" valign="top">Firewall protection, NAT support, VPN, IGMP snooping, manageable, IP address filtering
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Switching Protocol</td><td class="techvalue" width="100%" align="right" valign="top">Ethernet
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Connectivity Technology</td><td class="techvalue" width="100%" align="right" valign="top">Wired
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Data Link Protocol</td><td class="techvalue" width="100%" align="right" valign="top">Ethernet, Fast Ethernet
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Ports Qty</td><td class="techvalue" width="100%" align="right" valign="top">5-port switch
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Remote Management Protocol</td><td class="techvalue" width="100%" align="right" valign="top">SNMP 3, HTTP
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Routing Protocol</td><td class="techvalue" width="100%" align="right" valign="top">OSPF, RIP-1, RIP-2, IGMPv2, DVMRP, OSPFv2, PIM-SM, PIM-DM
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Network / Transport Protocol</td><td class="techvalue" width="100%" align="right" valign="top">AppleTalk, DECnet, UDP/IP, L2TP, VoIP, IP/IPX, IPSec
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Miscellaneous</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Compliant Standards</td><td class="techvalue" width="100%" align="right" valign="top">IEEE 802.1x
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Authentication Method</td><td class="techvalue" width="100%" align="right" valign="top">Secure Shell (SSH), RADIUS, PAP, CHAP, TACACS
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Encryption Algorithm</td><td class="techvalue" width="100%" align="right" valign="top">DES, Triple DES, SHA, AES, SSL
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Power</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Power Device</td><td class="techvalue" width="100%" align="right" valign="top">Power supply
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top">
             
            Environmental Parameters</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Min Operating Temperature</td><td class="techvalue" width="100%" align="right" valign="top">0 °C
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Max Operating Temperature</td><td class="techvalue" width="100%" align="right" valign="top">40 °C
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top">
               
              Humidity Range Operating</td><td class="techvalue" width="100%" align="right" valign="top">5 - 80%
               
            </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr></tbody>
?>

 

 

[attachment deleted by admin]

Link to comment
https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/
Share on other sites

Still cannot sort this out. I'm separating the specification table in two. One is the tech spec type(name) and the other the actual value.

 

<?php

				$left			        = $xpath->query('//td[@class="techspec"]', $node)->item(0);
				$right			= $xpath->query('//td[@class="techvalue"]', $node)->item(0);
?>

 

I think the problem is in the main part where i can't seem to find correct xpath for the specification table since it has no name and is wraped in a table and another one etc. Any ideas?

 

 

Your attached HTML file and the sample HTML do not correlate and your PHP code doesn't appear to match the structure of either of them.

 

That said, I took the source code from this page because that looks to be more useful. The following snippet is then able to extract the techspec/value pairs from the HTML table (your HTML/PHP might be different so this may not be simply copy/paste).

 

// Get product specification table rows
$prod_spec_nodes = $xpath->query('//tr[@class="techspec b"]/parent::table/tr[td[2][starts-with(@class, "techspec")]]');

$group = '';
foreach ($prod_spec_nodes as $node) {

$spec  = $xpath->query('.//td[@class="techspec"]', $node);
$value = $xpath->query('.//td[@class="techvalue"]', $node);

if ($spec->length == 1 AND $value->length == 1) {
	$spec = trim($spec->item(0)->nodeValue);
	$value = trim($value->item(0)->nodeValue);
	add_specs(array('productid' => 12345, 'group' => $group, 'left' => $spec, 'right' => $value));
}	
}

function add_specs($values) {
var_dump($values);
}

 

[ot]

Part of the "no-one helping" could well have been due to the mismatching source, or due to the nature of the question (perhaps people run away when they see "XPath") or, as is increasingly the case for me, they view the thread once and intend to come back and have a proper look at the problem but the thread disappears from the list when they next view the forum.

[/ot]

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.