Jump to content

Parse Html With Xpath


weep

Recommended Posts

Hey guys,

 

Can't seem to wrap my head around this. This is what I have:

 

$husdjur = new DOMDocument();
@$husdjur->loadHTML("test.html");
$xpath = new DOMXPath($husdjur);
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');
print_r($tableRows);

 

And this is what I get:

 

DOMNodeList Object ( )

 

Here is a sample of test.html (in this case, I am going after the "5166" entry, this file is massive):

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- saved from url=(0077)https://xxxxxxxxxxx.net/api/excel/usagequantities?period=300d&format=html -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: *snip*</style>
</head>
<body>
<table class="responsedata">
<thead>
<tr>
<th>Ärendenr</th>
<th>Status</th>
<th>Ärende skapat datum</th>
<th>Skapad av</th>
<th>Ändrad</th>
<th>Ändrad av</th>
And so on, 50 something more...
</tr>
</thead>
<tbody>
<tr>
<td>5166</td>
<td>Avslutad</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name1</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name2</td>
<td>K8 norr städ</td>
And so on, 50 something more...

 

Any help much appreciated, cheers!

Link to comment
https://forums.phpfreaks.com/topic/271536-parse-html-with-xpath/
Share on other sites

After looking at your code I noticed a few things:

 

1) You should be using the method loadHTMLFile() not loadHTML(). The former method loads HTML from a FILE, the method you were using thought "test.html" was literally the HTML.

2) Turn on error reporting when you are debugging.

3) You should be declaring your namespace, in this case it's xmlns.

 

Try:

 

<?php

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);

$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');

foreach($tableRows as $result)
{
echo $result->nodeValue;
echo "\n";
}

?>

I think you have the wrong indexes on the tr and td too. Starts counting at zero.

 

Also, your expression isn't doing what you think it's doing. [X] is not an offset, it's a condition. Try this more correct and more powerful version which goes directly to the cell you want without the guesswork of where it is:

//table[@class='responsedata']//td[text()='5166']/following-sibling::td[position()=1]

 

[edit] Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

No, XPath indexing starts at 1. Also, your expression matches on

Avslutad .

 

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

 

  Quote
Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

Good to know.

  On 12/3/2012 at 10:22 PM, Maq said:

No, XPath indexing starts at 1.

...Hmm. Okay. Wonder what I was thinking of. I even contradicted myself with the position()=1.

 

  On 12/3/2012 at 10:22 PM, Maq said:

Also, your expression matches on

Avslutad.

I misread the question and thought the problem was finding the username. "after the 5166".

Sorry for the delay :sweat:

 

Sweet, plenty of awesome tips to try. I will poke around for a bit and return with a solution/result/more questions.

 

  On 12/3/2012 at 10:22 PM, Maq said:

No, XPath indexing starts at 1. Also, your expression matches on <td>Avslutad</td>.

 

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

 

I want to grab every cell within every <tr>, se picture:

 

11553699.jpg

  On 12/4/2012 at 7:25 AM, weep said:
I want to grab every cell within every <tr>

 

Then you likely want to get all of the rows, loop over them and access each row's individual collection of cells.  The basic idea is something like:

 

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
    $cells = $xpath->query('td', $row);
    foreach ($cells as $cell) {
        echo $cell->getNodePath();
        echo ' has value ';
        var_export($cell->nodeValue);
        echo "<br>\n";
    }
}

Thank you for all your help guys! :happy-04: Solution for this thread:

 

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);
$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
   $cells = $xpath->query('td', $row);
   foreach ($cells as $cell) {
    echo $cell->getNodePath();
    echo ' has value ';
    var_export($cell->nodeValue);
    echo "<br>\n";
   }
}

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.