Jump to content

Parse Html With Xpath (Part 2)


weep

Recommended Posts

Hi guys,

 

Some time ago I asked for help with xpath and swiftly received it, thank you! Now, I have almost the same problem. Here where it all started:

 

http://forums.phpfre...h/#entry1397191

 

It was working perfectly for some time, until today, it seems that provider made a change to his source and I cannot for my life find what the problem is. First off, Maq said in the previous thread:

"First, try closing the <meta element so it's valid XHTML.".

 

It is now closed and gives me a warning instead, forcing me to go with @$husdjur->loadHTMLFile.

:tease-01:

 

So far so good, but that's where my luck ends... I assume that my old xpath is wrong, but I cant figure out why...

 

Warning: DOMXPath::query() [domxpath.query]: Invalid expression

 

My code, where I grab the value from every cell and poke them inside a database:

 

$husdjur = new DOMDocument();
@$husdjur->loadHTMLFile("mellanlagring.html");
$xpath = new DOMXPath($husdjur);
$xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml");
*snip*
$tableRows = $xpath->query('/html/body/table/tbody/tr/');
*snip*
foreach ($tableRows as $row) {
$cells = $xpath->query('td', $row);


foreach ($cells as $cell) {

$cellvalue[$i] = $cell->nodeValue;
$cellvalue[$i] = utf8_decode($cellvalue[$i]);
$i++;
}


$sql = "INSERT INTO remotexdump *snip*)";
mysql_query($sql,$con);


$i = 0 ;
}

 

The new .html code:

 

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: 1px solid #ccc; border-collapse: collapse; vertical-align: top; } TABLE.responsedata TD { padding-right: 0.2em; } TABLE.responsedata TH { border-bottom: 1px solid #000; }</style>
</meta>
</head>
<table class="responsedata">
<thead>
<tr>
<th>Ärendenr</th>
<th>Status</th>
<th>Ärende skapat datum</th>
<th>Skapad av</th>
<th>Ändrad</th>
<th>Ändrad av</th>
<th>Titel (*)</th>
<th>Affärssystem Id</th>
and so on...
</tr>
</thead>
<tr style="color: #f00">
<td>7968231231241</td>
<td>Påbörjad</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2001-02-18 12:09</td>
<td>Rapid2222222</td>
<td style="mso-number-format:'yyyy-mm-dd hh:mm';">2003-02-18 12:24</td>
<td>shs</td>
<td>Strömlöst i korridorerna </td>
<td>Strömlöst i korridorerna </td>
<td>12xxx4</td><td>XXXX AB - Fast</td><td>xxxx</td>
td>Hus 02</td><td>Röntgen</td><td>Objekt</td><td>120xxxx</td>
and so on...

 

Any help is much appreciated!

Link to comment
Share on other sites

First of all the meta-tag should be self-closing, and most definitely not wrap the style-tag. Which means that those tree lines should have read as this:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: 1px solid #ccc; border-collapse: collapse; vertical-align: top; } TABLE.responsedata TD { padding-right: 0.2em; } TABLE.responsedata TH { border-bottom: 1px solid #000; }</style>

 

You are also lacking the body-tag, which seems to be the principle reason for your problem.

 

PS: Always ensure that your HTML code is valid, preferably via the W3C HTML validator. Whatever you do, don't suppress error and warning messages.

Edited by Christian F.
Link to comment
Share on other sites

Seems like they removed the tbody tag too.

 

Though, if you do not have permissions to scrape their site, you should be reconsidering your course of actions. While the table data itself might not be protected by copyright, they are within their rights to deny you from automatically parsing their site.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.