jk2010 Posted January 3, 2010 Share Posted January 3, 2010 Hi guys, I've got a data scraping script that is not picking few html elements/fields that I need. I can't get the following elements: product title price and specification It could be that the xpath is not correct since the website has a lots of div/tables without any id or class. Please look at the xpath and the source code of the html page to suggest correction. cheers HERE IS MY CODE FOR THE ABOVE <?php // go through each product $products_nodes = $listing_xpath->query('//div/a[contains(@href, "item-details")]'); // get product details $prod_title_node = $xpath->query('//descendant::h1'); $prod_price_node = $xpath->query('//div[@class="details-subprice"]'); $prod_spec_nodes = $xpath->query('//html/body/table/tbody/tr/td/table[2]/tbody/tr/td[3]/div[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table[2]/*'); /* save specification records */ $group = ''; foreach($prod_spec_nodes as $node) { if(preg_match('!h[1234]!i', $node->tagName)) { $group = $node->nodeValue; } $left = $xpath->query('//td[@class="techspec"]', $node)->item(0); $right = $xpath->query('//td[@class="techvalue"]', $node)->item(0); if ($left && $right ) { add_specs(array('productid' => $prodid, 'group' => $group, 'left' => $left->nodeValue, 'right' => $right->nodeValue) ); } } ?> and Here is the extracted specification part source code of the page i need information from, full code of the page is attached. <?php <tbody><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> General</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Enclosure Type</td><td class="techvalue" width="100%" align="right" valign="top">External </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Device Type</td><td class="techvalue" width="100%" align="right" valign="top">Router </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Width</td><td class="techvalue" width="100%" align="right" valign="top">30 cm </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Depth</td><td class="techvalue" width="100%" align="right" valign="top">19 cm </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Height</td><td class="techvalue" width="100%" align="right" valign="top">4.3 cm </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Weight</td><td class="techvalue" width="100%" align="right" valign="top">1.9 kg </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Expansion / Connectivity</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Interfaces</td><td class="techvalue" width="100%" align="right" valign="top">5 x network node - Ethernet 10Base-T/100Base-TX - RJ-45 ¦ 2 x network - Ethernet 10Base-T/100Base-TX - RJ-45 ( WAN ) ¦ 1 x management </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Manufacturer Warranty</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Service & Support Details</td><td class="techvalue" width="100%" align="right" valign="top">2 years warranty </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Service & Support</td><td class="techvalue" width="100%" align="right" valign="top">Limited warranty </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Networking</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Features</td><td class="techvalue" width="100%" align="right" valign="top">Firewall protection, NAT support, VPN, IGMP snooping, manageable, IP address filtering </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Switching Protocol</td><td class="techvalue" width="100%" align="right" valign="top">Ethernet </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Connectivity Technology</td><td class="techvalue" width="100%" align="right" valign="top">Wired </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Data Link Protocol</td><td class="techvalue" width="100%" align="right" valign="top">Ethernet, Fast Ethernet </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Ports Qty</td><td class="techvalue" width="100%" align="right" valign="top">5-port switch </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Remote Management Protocol</td><td class="techvalue" width="100%" align="right" valign="top">SNMP 3, HTTP </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Routing Protocol</td><td class="techvalue" width="100%" align="right" valign="top">OSPF, RIP-1, RIP-2, IGMPv2, DVMRP, OSPFv2, PIM-SM, PIM-DM </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Network / Transport Protocol</td><td class="techvalue" width="100%" align="right" valign="top">AppleTalk, DECnet, UDP/IP, L2TP, VoIP, IP/IPX, IPSec </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Miscellaneous</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Compliant Standards</td><td class="techvalue" width="100%" align="right" valign="top">IEEE 802.1x </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Authentication Method</td><td class="techvalue" width="100%" align="right" valign="top">Secure Shell (SSH), RADIUS, PAP, CHAP, TACACS </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Encryption Algorithm</td><td class="techvalue" width="100%" align="right" valign="top">DES, Triple DES, SHA, AES, SSL </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Power</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Power Device</td><td class="techvalue" width="100%" align="right" valign="top">Power supply </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr class="techspec b"><td height="17"></td><td colspan="2" style="background-color: rgb(43, 120, 185); color: white;" valign="top"> Environmental Parameters</td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Min Operating Temperature</td><td class="techvalue" width="100%" align="right" valign="top">0 °C </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr style="background-color: rgb(233, 233, 233);"><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Max Operating Temperature</td><td class="techvalue" width="100%" align="right" valign="top">40 °C </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr><tr><td height="17"></td><td class="techspec" align="left" nowrap="true" valign="top"> Humidity Range Operating</td><td class="techvalue" width="100%" align="right" valign="top">5 - 80% </td></tr><tr><td colspan="3"><img src="/images/main/pixel-clr.gif" width="1" height="1"></td></tr></tbody> ?> [attachment deleted by admin] Quote Link to comment https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/ Share on other sites More sharing options...
jk2010 Posted January 4, 2010 Author Share Posted January 4, 2010 I've applied some changes in xpath but still no luck. would appreciate your help. Quote Link to comment https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/#findComment-988043 Share on other sites More sharing options...
jk2010 Posted January 6, 2010 Author Share Posted January 6, 2010 Still cannot sort this out. I'm separating the specification table in two. One is the tech spec type(name) and the other the actual value. <?php $left = $xpath->query('//td[@class="techspec"]', $node)->item(0); $right = $xpath->query('//td[@class="techvalue"]', $node)->item(0); ?> I think the problem is in the main part where i can't seem to find correct xpath for the specification table since it has no name and is wraped in a table and another one etc. Any ideas? Quote Link to comment https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/#findComment-989585 Share on other sites More sharing options...
jk2010 Posted January 8, 2010 Author Share Posted January 8, 2010 gave up and simply changed the testing site. :-\. Really disappointed that no one helped here, and surprised. Quote Link to comment https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/#findComment-990815 Share on other sites More sharing options...
salathe Posted January 8, 2010 Share Posted January 8, 2010 Your attached HTML file and the sample HTML do not correlate and your PHP code doesn't appear to match the structure of either of them. That said, I took the source code from this page because that looks to be more useful. The following snippet is then able to extract the techspec/value pairs from the HTML table (your HTML/PHP might be different so this may not be simply copy/paste). // Get product specification table rows $prod_spec_nodes = $xpath->query('//tr[@class="techspec b"]/parent::table/tr[td[2][starts-with(@class, "techspec")]]'); $group = ''; foreach ($prod_spec_nodes as $node) { $spec = $xpath->query('.//td[@class="techspec"]', $node); $value = $xpath->query('.//td[@class="techvalue"]', $node); if ($spec->length == 1 AND $value->length == 1) { $spec = trim($spec->item(0)->nodeValue); $value = trim($value->item(0)->nodeValue); add_specs(array('productid' => 12345, 'group' => $group, 'left' => $spec, 'right' => $value)); } } function add_specs($values) { var_dump($values); } [ot] Part of the "no-one helping" could well have been due to the mismatching source, or due to the nature of the question (perhaps people run away when they see "XPath") or, as is increasingly the case for me, they view the thread once and intend to come back and have a proper look at the problem but the thread disappears from the list when they next view the forum. [/ot] Quote Link to comment https://forums.phpfreaks.com/topic/187016-please-help-with-curl-xpath-php/#findComment-990825 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.