Mix1988 Posted February 28, 2013 Share Posted February 28, 2013 Hi, how could i extract table data from html? I tried simpleHTMLdom parses with no luck, then i found out that easyer method but its not working for me: <?php $data = file_get_contents('demo.htm'); $dom = new domDocument; @$dom->loadHTML($data); $dom->preserveWhiteSpace = false; $tables = $dom->getElementsByTagName('table'); $rows = $tables->item(1)->getElementsByTagName('tr'); foreach ($rows as $row) { $cols = $row->getElementsByTagName('td'); //echo $cols[2]; print_r($cols); } ?> I get DOMNodeList Object ( [length] => 0 ) like theres nothing. Html table is like this: <table cellpadding="0px" cellspacing="0px" style="table-layout:fixed" ;=""> <tbody><tr> <td width="20" style="min-width:20px;max-width:20px;"></td> <td width="100" style="min-width:100px;max-width:100px;"></td> <td width="150" style="min-width:150px;max-width:150px;"></td> <td width="400" style="min-width:400px;max-width:400px;"></td> <td width="200" style="min-width:200px;max-width:200px;"></td> </tr> <tr> <td rowspan="5"></td> <td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Aeg</b><br>15.12.2010</td> <td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Koht</b><br>Harjumaa</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> BMW 525TDS, 1997</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 0%</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> kuni 500 eurot</td> </tr> <tr> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> OPEL ASTRA STATION WAGON, 2006</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 100%</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Käsitlev kindlustusandja:</b> QBE Insurance (Europe) Limited Eesti filiaal</td> </tr> <tr> <td rowspan="5"></td> <td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Aeg</b><br>28.08.2010</td> <td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Koht</b><br>Tartu, Tartumaa</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> AUDI A4, 1996</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 0%</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> 500 kuni 2000 eurot</td> </tr> <tr> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> BMW 525TDS, 1997</td> <td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 100%</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td> </tr> <tr> <td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Käsitlev kindlustusandja:</b> If P&C Insurance AS</td> </tr> <tr> <td></td> <td colspan="4" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"> </td> </tr> </tbody></table> How could i parse this table best way? Quote Link to comment Share on other sites More sharing options...
jcbones Posted February 28, 2013 Share Posted February 28, 2013 Seems the domDocument() doesn't like the way the table is formed. To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking. Here is your first error: DomDocument cannot read the file correctly. It is throwing errors on every line that has ;="" in the attributes. This is because it cannot apply the attribute name. If you fix that, it still will not work, and is showing an empty nodeList. You can use echo '<pre>' . print_r($dom,true) . '</pre>'; to see what the object contains. Quote Link to comment Share on other sites More sharing options...
Mix1988 Posted February 28, 2013 Author Share Posted February 28, 2013 Seems the domDocument() doesn't like the way the table is formed. To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking. Here is your first error: DomDocument cannot read the file correctly. It is throwing errors on every line that has ;="" in the attributes. This is because it cannot apply the attribute name. If you fix that, it still will not work, and is showing an empty nodeList. You can use echo '<pre>' . print_r($dom,true) . '</pre>'; to see what the object contains. Ok i see, so this isnt a good method afterall for geting data from html table, what should i use? As far as googleing goes everybody seems to like SimpleHtlmDom parser, but i had 0 sucess with it... Quote Link to comment Share on other sites More sharing options...
teynon Posted March 1, 2013 Share Posted March 1, 2013 I built my own DOM class a while ago: http://tomsfreelance.com/DOMe/DOMe.phps I tested it on your table and it worked. I can't guarantee it will always work. (Requires valid HTML.) Breaking it down will be up to you though. Here's an example: <?php require_once("DOMe.php"); $dom = new DOMe("div"); $dom->importHTML(file_get_contents("file.html")); echo $dom->generate(); echo "<pre>" . print_r($dom, true) . "</pre>"; Quote Link to comment Share on other sites More sharing options...
jcbones Posted March 1, 2013 Share Posted March 1, 2013 You can use simpleHtmlDom. The syntax would be: <?php include('path/to/simple_html_dom.php'); $dom = file_get_html('demo.htm'); $table = $dom->find('table',0); $rows = $table->children(0)->children(); foreach($rows as $row) { foreach($row->children() as $column) { if(!empty($column->innertext)) { echo $column->innertext . '<br />' . PHP_EOL; } } } ?> Quote Link to comment Share on other sites More sharing options...
teynon Posted March 1, 2013 Share Posted March 1, 2013 In fact, here is the example I posted using your table. http://tomsfreelance.com/DOMe/DOM_Import.php Quote Link to comment Share on other sites More sharing options...
Solution teynon Posted March 1, 2013 Solution Share Posted March 1, 2013 (edited) I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it: <?php require_once("DOMe.php"); $dom = new DOMe("div"); $dom->importHTML(file_get_contents("file.html")); echo $dom->generate(); $rows = $dom->getElementsByTagName("tr"); $data = array(); foreach ($rows as $row) { $cells = $row->getElementsByTagName("td"); $cellData = array(); foreach ($cells as $cell) { $cellData[] = $cell->generate(); } $data[] = $cellData; } echo "<pre>" . print_r($data, true) . "</pre>"; Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps Edited March 1, 2013 by teynon Quote Link to comment Share on other sites More sharing options...
Mix1988 Posted March 1, 2013 Author Share Posted March 1, 2013 I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it: <?php require_once("DOMe.php"); $dom = new DOMe("div"); $dom->importHTML(file_get_contents("file.html")); echo $dom->generate(); $rows = $dom->getElementsByTagName("tr"); $data = array(); foreach ($rows as $row) { $cells = $row->getElementsByTagName("td"); $cellData = array(); foreach ($cells as $cell) { $cellData[] = $cell->generate(); } $data[] = $cellData; } echo "<pre>" . print_r($data, true) . "</pre>"; Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps This is really awesome and suits me best, thank you very much!!! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.