Parse Html With Xpath

weep · December 3, 2012

Hey guys,

Can't seem to wrap my head around this. This is what I have:

$husdjur = new DOMDocument();
@$husdjur->loadHTML("test.html");
$xpath = new DOMXPath($husdjur);
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');
print_r($tableRows);

And this is what I get:

DOMNodeList Object ( )

Here is a sample of test.html (in this case, I am going after the "5166" entry, this file is massive):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- saved from url=(0077)https://xxxxxxxxxxx.net/api/excel/usagequantities?period=300d&format=html -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: *snip*</style>
</head>
<body>
<table class="responsedata">
<thead>
<tr>
<th>Ärendenr</th>
<th>Status</th>
<th>Ärende skapat datum</th>
<th>Skapad av</th>
<th>Ändrad</th>
<th>Ändrad av</th>
And so on, 50 something more...
</tr>
</thead>
<tbody>
<tr>
<td>5166</td>
<td>Avslutad</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name1</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name2</td>
<td>K8 norr städ</td>
And so on, 50 something more...

Any help much appreciated, cheers!

Maq · December 3, 2012

First, try closing the

Maq · December 3, 2012

After looking at your code I noticed a few things:

1) You should be using the method loadHTMLFile() not loadHTML(). The former method loads HTML from a FILE, the method you were using thought "test.html" was literally the HTML.

2) Turn on error reporting when you are debugging.

3) You should be declaring your namespace, in this case it's xmlns.

Try:

<?php

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);

$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');

foreach($tableRows as $result)
{
echo $result->nodeValue;
echo "\n";
}

?>

requinix · December 3, 2012

I think you have the wrong indexes on the tr and td too. Starts counting at zero.

Also, your expression isn't doing what you think it's doing. [X] is not an offset, it's a condition. Try this more correct and more powerful version which goes directly to the cell you want without the guesswork of where it is:

//table[@class='responsedata']//td[text()='5166']/following-sibling::td[position()=1]

[edit] Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

Maq · December 3, 2012

No, XPath indexing starts at 1. Also, your expression matches on

Avslutad .

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

Quote
Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

Good to know.

requinix · December 3, 2012

On 12/3/2012 at 10:22 PM, Maq said:

No, XPath indexing starts at 1.

...Hmm. Okay. Wonder what I was thinking of. I even contradicted myself with the position()=1.

On 12/3/2012 at 10:22 PM, Maq said:

Also, your expression matches on
Avslutad.

I misread the question and thought the problem was finding the username. "after the 5166".

weep · December 4, 2012

Sorry for the delay

Sweet, plenty of awesome tips to try. I will poke around for a bit and return with a solution/result/more questions.

On 12/3/2012 at 10:22 PM, Maq said:

No, XPath indexing starts at 1. Also, your expression matches on <td>Avslutad</td>.

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

I want to grab every cell within every <tr>, se picture:

salathe · December 4, 2012

On 12/4/2012 at 7:25 AM, weep said:
I want to grab every cell within every <tr>

Then you likely want to get all of the rows, loop over them and access each row's individual collection of cells. The basic idea is something like:

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
    $cells = $xpath->query('td', $row);
    foreach ($cells as $cell) {
        echo $cell->getNodePath();
        echo ' has value ';
        var_export($cell->nodeValue);
        echo "<br>\n";
    }
}

weep · December 4, 2012

Thank you for all your help guys! Solution for this thread:

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);
$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
   $cells = $xpath->query('td', $row);
   foreach ($cells as $cell) {
    echo $cell->getNodePath();
    echo ' has value ';
    var_export($cell->nodeValue);
    echo "<br>\n";
   }
}

Sign In

Parse Html With Xpath

Recommended Posts

weep

Link to comment

Share on other sites

Maq

Link to comment

Share on other sites

Maq

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Maq

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

weep

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

weep

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information