weep Posted February 18, 2013 Share Posted February 18, 2013 Hi guys, Some time ago I asked for help with xpath and swiftly received it, thank you! Now, I have almost the same problem. Here where it all started: http://forums.phpfre...h/#entry1397191 It was working perfectly for some time, until today, it seems that provider made a change to his source and I cannot for my life find what the problem is. First off, Maq said in the previous thread: "First, try closing the <meta element so it's valid XHTML.". It is now closed and gives me a warning instead, forcing me to go with @$husdjur->loadHTMLFile. So far so good, but that's where my luck ends... I assume that my old xpath is wrong, but I cant figure out why... Warning: DOMXPath::query() [domxpath.query]: Invalid expression My code, where I grab the value from every cell and poke them inside a database: $husdjur = new DOMDocument(); @$husdjur->loadHTMLFile("mellanlagring.html"); $xpath = new DOMXPath($husdjur); $xpath->registerNamespace("xmlns", "http://www.w3.org/1999/xhtml"); *snip* $tableRows = $xpath->query('/html/body/table/tbody/tr/'); *snip* foreach ($tableRows as $row) { $cells = $xpath->query('td', $row); foreach ($cells as $cell) { $cellvalue[$i] = $cell->nodeValue; $cellvalue[$i] = utf8_decode($cellvalue[$i]); $i++; } $sql = "INSERT INTO remotexdump *snip*)"; mysql_query($sql,$con); $i = 0 ; } The new .html code: <?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="content-type" content="text/html;charset=utf-8"> <style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: 1px solid #ccc; border-collapse: collapse; vertical-align: top; } TABLE.responsedata TD { padding-right: 0.2em; } TABLE.responsedata TH { border-bottom: 1px solid #000; }</style> </meta> </head> <table class="responsedata"> <thead> <tr> <th>Ärendenr</th> <th>Status</th> <th>Ärende skapat datum</th> <th>Skapad av</th> <th>Ändrad</th> <th>Ändrad av</th> <th>Titel (*)</th> <th>Affärssystem Id</th> and so on... </tr> </thead> <tr style="color: #f00"> <td>7968231231241</td> <td>Påbörjad</td> <td style="mso-number-format:'yyyy-mm-dd hh:mm';">2001-02-18 12:09</td> <td>Rapid2222222</td> <td style="mso-number-format:'yyyy-mm-dd hh:mm';">2003-02-18 12:24</td> <td>shs</td> <td>Strömlöst i korridorerna </td> <td>Strömlöst i korridorerna </td> <td>12xxx4</td><td>XXXX AB - Fast</td><td>xxxx</td> td>Hus 02</td><td>Röntgen</td><td>Objekt</td><td>120xxxx</td> and so on... Any help is much appreciated! Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/ Share on other sites More sharing options...
Christian F. Posted February 18, 2013 Share Posted February 18, 2013 (edited) First of all the meta-tag should be self-closing, and most definitely not wrap the style-tag. Which means that those tree lines should have read as this: <meta http-equiv="content-type" content="text/html;charset=utf-8" /> <style type="text/css">TABLE.responsedata { font-family: Calibri, Arial, monaco, monospace; font-size: 11pt } TABLE.responsedata,TABLE.responsedata TD { border: 1px solid #ccc; border-collapse: collapse; vertical-align: top; } TABLE.responsedata TD { padding-right: 0.2em; } TABLE.responsedata TH { border-bottom: 1px solid #000; }</style> You are also lacking the body-tag, which seems to be the principle reason for your problem. PS: Always ensure that your HTML code is valid, preferably via the W3C HTML validator. Whatever you do, don't suppress error and warning messages. Edited February 18, 2013 by Christian F. Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413107 Share on other sites More sharing options...
weep Posted February 18, 2013 Author Share Posted February 18, 2013 Unfortunately that is what I have to work with, that is the way the file is delivered. Is there no way to work around this without manual editing? Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413111 Share on other sites More sharing options...
Christian F. Posted February 18, 2013 Share Posted February 18, 2013 Get in touch with the people who're responsible for generating that file, and tell them it's not valid XHTML. Which is causing problems with the parsing. In the meantime, try removing the body-tag from the path. Might help, but it's a rather ugly hack. Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413114 Share on other sites More sharing options...
weep Posted February 18, 2013 Author Share Posted February 18, 2013 Hehe, I could try talking to them but I doubt it will help (it's in their interest to prevent me from succeeding). Removing body from path did not help... Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413116 Share on other sites More sharing options...
Christian F. Posted February 18, 2013 Share Posted February 18, 2013 Seems like they removed the tbody tag too. Though, if you do not have permissions to scrape their site, you should be reconsidering your course of actions. While the table data itself might not be protected by copyright, they are within their rights to deny you from automatically parsing their site. Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413120 Share on other sites More sharing options...
weep Posted February 18, 2013 Author Share Posted February 18, 2013 It's not a problem, the data is ours to use and we pay good money for it. It's just that we want to do some parts ourselves Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413124 Share on other sites More sharing options...
weep Posted February 20, 2013 Author Share Posted February 20, 2013 Bump: What if we treat that file as xml file (save it as .xml and reaload it)? Would that help? Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413593 Share on other sites More sharing options...
weep Posted February 20, 2013 Author Share Posted February 20, 2013 Fixed HTML with Tidy extension! Works perfectly now Quote Link to comment https://forums.phpfreaks.com/topic/274625-parse-html-with-xpath-part-2/#findComment-1413623 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.