Jump to content

Extract specific text from html elements using Xpath - help needed


jk2010

Recommended Posts

hi guys,

 

how can i use preg_match or any other condition when trying to extract information from a html page? I've tried preg_match but i get an error because i'm using it in xpath query.

 

This is my query code:

 

$prod_quicklinx_node = $xpath->query('//table[@class=maintbl]/descendant::div/td[@class=textblacksmblue]');

$prod_mfr_node = $xpath->query('//table[@class=maintbl]/descendant::td[@class=textblacksm] and not(contains(@*, "Manufacturer"))');

 

I don't get no results.

 

 

SOURCE CODE OF HTML ENTITIES (i only need the BOLD items, code, mfr name and mfr#)

                    <tbody><tr> 
                        <td align="left" nowrap="nowrap">
                            
                                <a href="javascript:MM_openBrWindow('http://img.misco.co.uk/images/uploadedimages/large/20091027141548.jpg','LargeImage','scrollbars=no,width=350,height=350')" class="details"> 
                                <img src="/images/itemdetails/icon-enlarge.gif" alt="" align="absmiddle" border="0" hspace="0"></a>
                                <!--<img src="http://img3.misco.co.uk/images/misc/pixel-clr.gif" width="2" height="1" alt="">-->
                            
                                <a href="/applications/email/emailafriend.asp"> <img src="/images/itemdetails/icon-email.gif" alt="" align="absmiddle" border="0" hspace="6"></a><a href="http://www.misco.co.uk/applications/SearchTools/item-details-print.asp?EdpNo=336830&Sku=Q151273"><img src="/images/itemdetails/icon-print.gif" alt="" align="absmiddle" border="0" hspace="4"></a> 
                            
                        </td>
                        <td width="44" align="right" valign="top"><img src="http://img1.misco.co.uk/images/itemdetails/itemtitle_yellowleft.gif" alt="" width="44" height="24"></td>
                        <td style="background-image: url(http://img.misco.co.uk/images/itemdetails/itemtitle_yellow_bg.gif); background-repeat: repeat-x;" class="textblackmed" width="340" valign="middle">
                            <table width="100%" border="0" cellpadding="0" cellspacing="0" height="18">
                            <tbody><tr valign="top"> 
                                <td class="textblacksm" width="35" nowrap="nowrap">Misco No: </td>
                                
                                    [b]<td class="textblacksmblue" width="40%"><b>Q151273</b></td>[/b]
                                
                                <td align="right" nowrap="nowrap">
                                   <table border="0" cellpadding="3" cellspacing="0">
                                   <tbody><tr valign="top">                                   
                                     <td><div style="position: relative; top: -3px;"><a href="javascript:void(0);" onclick="postReview();" alt="Add Review" style="font-size: 12px;">
						                 <img src="/images/itemdetails/ADD_REVI.GIF" alt="Add Review" border="0">
						                 </a></div></td></tr></tbody></table></td></tr>
                                <!--</td>
                            </tr>-->
                            </tbody></table>
                        </td>
                        <td width="18" align="right" valign="top"><img src="http://img3.misco.co.uk/images/itemdetails/itemtitle_yellowright1.gif" alt="" width="18" height="24"></td>
                    </tr>
                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left">
                    </td>
                    <td></td>                        
                    </tr>
                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left">
                    <table>
                    <tbody><tr><td class="textblacksm" width="110">Manufacturer:</td>[b]<td class="textblacksm" nowrap="nowrap"> <strong>Canon </strong> </td>[/b]</tr>
                    <tr><td class="textblacksm" width="110">Manufacturer Part No:</td>[b]<td class="textblacksm" nowrap="nowrap"> <strong>2925B008AA </strong> </td>[/b]
                    </tr>
                    </tbody></table>
                    </td>
                    <td></td>                        
                    </tr>
      
                    </tbody>

 

Thanks for any help you can provide.

 

cheers

 

jari

Link to comment
Share on other sites

One way would to be grab the table which wraps all of your required information, then use a couple of XPath queries looking for the specific details within that wrapping table. For example:

 

$table = $xpath->query('//table[4]')->item(0);
$code  = $xpath->query("//td[starts-with(text(), 'Misco')]/following-sibling::td/b", $table)->item(0)->nodeValue;
$man   = rtrim($xpath->query("//td[.='Manufacturer:']/following-sibling::td/strong", $table)->item(0)->nodeValue);
$part  = rtrim($xpath->query("//td[.='Manufacturer Part No:']/following-sibling::td/strong", $table)->item(0)->nodeValue);

var_dump($code, $man, $part);

Link to comment
Share on other sites

One way would to be grab the table which wraps all of your required information, then use a couple of XPath queries looking for the specific details within that wrapping table. For example:

 

$table = $xpath->query('//table[4]')->item(0);
$code  = $xpath->query("//td[starts-with(text(), 'Misco')]/following-sibling::td/b", $table)->item(0)->nodeValue;
$man   = rtrim($xpath->query("//td[.='Manufacturer:']/following-sibling::td/strong", $table)->item(0)->nodeValue);
$part  = rtrim($xpath->query("//td[.='Manufacturer Part No:']/following-sibling::td/strong", $table)->item(0)->nodeValue);

var_dump($code, $man, $part);

 

Hi Salathe,

 

thank you for the reply mate.

 

I'll give this a go but one problem i have is that the wrapping TABLE does not have any name, class or id and also all the three elements are in a "td" that has the same class which is <td class="textblacksm">.

 

would this make any difference? i'm going to try ur idea first anyway. cheers

 

 

Link to comment
Share on other sites

That looks like it's to do with character encoding. I'd suggest you set the encoding of the page it's being output on to UTF-8 either through the HTML in the <head> section with meta tags or with the header function.

 

Edit: salathe beat me to it.

Link to comment
Share on other sites

Only within 10 minutes of posting them. Mine was technically a 'Faux Edit' (as you can tell by the fact it doesn't have an edit time at the bottom). I clicked post and it warned me salathe had posted so I stuck the disclaimer in before hitting submit again.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.