Jump to content

Extract specific text from html elements using Xpath - help needed


jk2010

Recommended Posts

hi guys,

 

how can i use preg_match or any other condition when trying to extract information from a html page? I've tried preg_match but i get an error because i'm using it in xpath query.

 

This is my query code:

 

$prod_quicklinx_node = $xpath->query('//table[@class=maintbl]/descendant::div/td[@class=textblacksmblue]');

$prod_mfr_node = $xpath->query('//table[@class=maintbl]/descendant::td[@class=textblacksm] and not(contains(@*, "Manufacturer"))');

 

I don't get no results.

 

 

SOURCE CODE OF HTML ENTITIES (i only need the BOLD items, code, mfr name and mfr#)

                    <tbody><tr> 
                        <td align="left" nowrap="nowrap">
                            
                                <a href="javascript:MM_openBrWindow('http://img.misco.co.uk/images/uploadedimages/large/20091027141548.jpg','LargeImage','scrollbars=no,width=350,height=350')" class="details"> 
                                <img src="/images/itemdetails/icon-enlarge.gif" alt="" align="absmiddle" border="0" hspace="0"></a>
                                <!--<img src="http://img3.misco.co.uk/images/misc/pixel-clr.gif" width="2" height="1" alt="">-->
                            
                                <a href="/applications/email/emailafriend.asp"> <img src="/images/itemdetails/icon-email.gif" alt="" align="absmiddle" border="0" hspace="6"></a><a href="http://www.misco.co.uk/applications/SearchTools/item-details-print.asp?EdpNo=336830&Sku=Q151273"><img src="/images/itemdetails/icon-print.gif" alt="" align="absmiddle" border="0" hspace="4"></a> 
                            
                        </td>
                        <td width="44" align="right" valign="top"><img src="http://img1.misco.co.uk/images/itemdetails/itemtitle_yellowleft.gif" alt="" width="44" height="24"></td>
                        <td style="background-image: url(http://img.misco.co.uk/images/itemdetails/itemtitle_yellow_bg.gif); background-repeat: repeat-x;" class="textblackmed" width="340" valign="middle">
                            <table width="100%" border="0" cellpadding="0" cellspacing="0" height="18">
                            <tbody><tr valign="top"> 
                                <td class="textblacksm" width="35" nowrap="nowrap">Misco No: </td>
                                
                                    [b]<td class="textblacksmblue" width="40%"><b>Q151273</b></td>[/b]
                                
                                <td align="right" nowrap="nowrap">
                                   <table border="0" cellpadding="3" cellspacing="0">
                                   <tbody><tr valign="top">                                   
                                     <td><div style="position: relative; top: -3px;"><a href="javascript:void(0);" onclick="postReview();" alt="Add Review" style="font-size: 12px;">
						                 <img src="/images/itemdetails/ADD_REVI.GIF" alt="Add Review" border="0">
						                 </a></div></td></tr></tbody></table></td></tr>
                                <!--</td>
                            </tr>-->
                            </tbody></table>
                        </td>
                        <td width="18" align="right" valign="top"><img src="http://img3.misco.co.uk/images/itemdetails/itemtitle_yellowright1.gif" alt="" width="18" height="24"></td>
                    </tr>
                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left">
                    </td>
                    <td></td>                        
                    </tr>
                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left">
                    <table>
                    <tbody><tr><td class="textblacksm" width="110">Manufacturer:</td>[b]<td class="textblacksm" nowrap="nowrap"> <strong>Canon </strong> </td>[/b]</tr>
                    <tr><td class="textblacksm" width="110">Manufacturer Part No:</td>[b]<td class="textblacksm" nowrap="nowrap"> <strong>2925B008AA </strong> </td>[/b]
                    </tr>
                    </tbody></table>
                    </td>
                    <td></td>                        
                    </tr>
      
                    </tbody>

 

Thanks for any help you can provide.

 

cheers

 

jari

One way would to be grab the table which wraps all of your required information, then use a couple of XPath queries looking for the specific details within that wrapping table. For example:

 

$table = $xpath->query('//table[4]')->item(0);
$code  = $xpath->query("//td[starts-with(text(), 'Misco')]/following-sibling::td/b", $table)->item(0)->nodeValue;
$man   = rtrim($xpath->query("//td[.='Manufacturer:']/following-sibling::td/strong", $table)->item(0)->nodeValue);
$part  = rtrim($xpath->query("//td[.='Manufacturer Part No:']/following-sibling::td/strong", $table)->item(0)->nodeValue);

var_dump($code, $man, $part);

One way would to be grab the table which wraps all of your required information, then use a couple of XPath queries looking for the specific details within that wrapping table. For example:

 

$table = $xpath->query('//table[4]')->item(0);
$code  = $xpath->query("//td[starts-with(text(), 'Misco')]/following-sibling::td/b", $table)->item(0)->nodeValue;
$man   = rtrim($xpath->query("//td[.='Manufacturer:']/following-sibling::td/strong", $table)->item(0)->nodeValue);
$part  = rtrim($xpath->query("//td[.='Manufacturer Part No:']/following-sibling::td/strong", $table)->item(0)->nodeValue);

var_dump($code, $man, $part);

 

Hi Salathe,

 

thank you for the reply mate.

 

I'll give this a go but one problem i have is that the wrapping TABLE does not have any name, class or id and also all the three elements are in a "td" that has the same class which is <td class="textblacksm">.

 

would this make any difference? i'm going to try ur idea first anyway. cheers

 

 

hi salathe thanks for you help, got it working now.

 

just one thing one the price side the result looks a bit funny, how can i clean it up?

 

 

[Price] => £11.74 inc VAT         Â

 

thanks a lot.

 

jari

That looks like it's to do with character encoding. I'd suggest you set the encoding of the page it's being output on to UTF-8 either through the HTML in the <head> section with meta tags or with the header function.

 

Edit: salathe beat me to it.

Only within 10 minutes of posting them. Mine was technically a 'Faux Edit' (as you can tell by the fact it doesn't have an edit time at the bottom). I clicked post and it warned me salathe had posted so I stuck the disclaimer in before hitting submit again.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.