Jump to content

Parse Html With Xpath


weep

Recommended Posts

Hey guys,

 

Can't seem to wrap my head around this. This is what I have:

 

$husdjur = new DOMDocument();
@$husdjur->loadHTML("test.html");
$xpath = new DOMXPath($husdjur);
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');
print_r($tableRows);

 

And this is what I get:

 

DOMNodeList Object ( )

 

Here is a sample of test.html (in this case, I am going after the "5166" entry, this file is massive):

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- saved from url=(0077)https://xxxxxxxxxxx.net/api/excel/usagequantities?period=300d&format=html -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: *snip*</style>
</head>
<body>
<table class="responsedata">
<thead>
<tr>
<th>Ärendenr</th>
<th>Status</th>
<th>Ärende skapat datum</th>
<th>Skapad av</th>
<th>Ändrad</th>
<th>Ändrad av</th>
And so on, 50 something more...
</tr>
</thead>
<tbody>
<tr>
<td>5166</td>
<td>Avslutad</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name1</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2012-10-08 10:27</td>
<td>Name2</td>
<td>K8 norr städ</td>
And so on, 50 something more...

 

Any help much appreciated, cheers!

Link to comment
Share on other sites

After looking at your code I noticed a few things:

 

1) You should be using the method loadHTMLFile() not loadHTML(). The former method loads HTML from a FILE, the method you were using thought "test.html" was literally the HTML.

2) Turn on error reporting when you are debugging.

3) You should be declaring your namespace, in this case it's xmlns.

 

Try:

 

<?php

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);

$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");
$tableRows = $xpath->query('/html/body/table/tbody/tr[1]/td[1]');

foreach($tableRows as $result)
{
echo $result->nodeValue;
echo "\n";
}

?>

Edited by Maq
Link to comment
Share on other sites

I think you have the wrong indexes on the tr and td too. Starts counting at zero.

 

Also, your expression isn't doing what you think it's doing. [X] is not an offset, it's a condition. Try this more correct and more powerful version which goes directly to the cell you want without the guesswork of where it is:

//table[@class='responsedata']//td[text()='5166']/following-sibling::td[position()=1]

 

[edit] Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

Edited by requinix
Link to comment
Share on other sites

No, XPath indexing starts at 1. Also, your expression matches on

Avslutad .

 

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

 

Also, the doesn't need to be closed. The parser is smart enough to know that it's automatically closed.

Good to know.

Edited by Maq
Link to comment
Share on other sites

No, XPath indexing starts at 1.

...Hmm. Okay. Wonder what I was thinking of. I even contradicted myself with the position()=1.

 

Also, your expression matches on

Avslutad.

I misread the question and thought the problem was finding the username. "after the 5166".

Edited by requinix
Link to comment
Share on other sites

Sorry for the delay :sweat:

 

Sweet, plenty of awesome tips to try. I will poke around for a bit and return with a solution/result/more questions.

 

No, XPath indexing starts at 1. Also, your expression matches on <td>Avslutad</td>.

 

Weep, if you tell us what exactly you're trying to match on, we can give you the best XPath solution.

 

I want to grab every cell within every <tr>, se picture:

 

11553699.jpg

Link to comment
Share on other sites

I want to grab every cell within every <tr>

 

Then you likely want to get all of the rows, loop over them and access each row's individual collection of cells.  The basic idea is something like:

 

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
    $cells = $xpath->query('td', $row);
    foreach ($cells as $cell) {
        echo $cell->getNodePath();
        echo ' has value ';
        var_export($cell->nodeValue);
        echo "<br>\n";
    }
}

Link to comment
Share on other sites

Thank you for all your help guys! :happy-04: Solution for this thread:

 

// Report all PHP errors
error_reporting(E_ALL);
error_reporting(-1);
$husdjur = new DOMDocument();
$husdjur->loadHTMLFile("test.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");

$tableRows = $xpath->query('/html/body/table/tbody/tr');
foreach ($tableRows as $row) {
   $cells = $xpath->query('td', $row);
   foreach ($cells as $cell) {
    echo $cell->getNodePath();
    echo ' has value ';
    var_export($cell->nodeValue);
    echo "<br>\n";
   }
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.