Jump to content

Extract data from html table


Mix1988

Recommended Posts

Hi, how could i extract table data from html?

 

I tried simpleHTMLdom parses with no luck, then i found out that easyer method but its not working for me:

 

<?php
$data = file_get_contents('demo.htm');

$dom = new domDocument;

@$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');

$rows = $tables->item(1)->getElementsByTagName('tr');

foreach ($rows as $row) {
        $cols = $row->getElementsByTagName('td');
        //echo $cols[2];
print_r($cols);
}

?>

 

I get DOMNodeList Object ( [length] => 0 ) like theres nothing.
 
Html table is like this:
<table cellpadding="0px" cellspacing="0px" style="table-layout:fixed" ;="">
<tbody><tr>
<td width="20" style="min-width:20px;max-width:20px;"></td>
<td width="100" style="min-width:100px;max-width:100px;"></td>
<td width="150" style="min-width:150px;max-width:150px;"></td>
<td width="400" style="min-width:400px;max-width:400px;"></td>
<td width="200" style="min-width:200px;max-width:200px;"></td>
</tr>
<tr>
<td rowspan="5"></td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Aeg</b><br>15.12.2010</td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Koht</b><br>Harjumaa</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> BMW 525TDS, 1997</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 0%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> kuni 500 eurot</td>
</tr>
<tr>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> OPEL ASTRA STATION WAGON, 2006</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 100%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Käsitlev kindlustusandja:</b> QBE Insurance (Europe) Limited Eesti filiaal</td>
</tr>

<tr>
<td rowspan="5"></td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Aeg</b><br>28.08.2010</td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Koht</b><br>Tartu, Tartumaa</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> AUDI A4, 1996</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 0%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> 500 kuni 2000 eurot</td>
</tr>
<tr>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> BMW 525TDS, 1997</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 100%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Käsitlev kindlustusandja:</b> If P&C Insurance AS</td>
</tr>

<tr>
<td></td>
<td colspan="4" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"> </td>
</tr>
</tbody></table>

 

 

How could i parse this table best way?

 

Link to comment
https://forums.phpfreaks.com/topic/275068-extract-data-from-html-table/
Share on other sites

Seems the domDocument() doesn't like the way the table is formed.  To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking.

 

Here is your first error:

 

DomDocument cannot read the file correctly.  It is throwing errors on every line that has ;="" in the attributes.  This is because it cannot apply the attribute name.

 

If you fix that, it still will not work, and is showing an empty nodeList. You can use

echo '<pre>' . print_r($dom,true) . '</pre>'; 

to see what the object contains.

Seems the domDocument() doesn't like the way the table is formed.  To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking.

 

Here is your first error:

 

DomDocument cannot read the file correctly.  It is throwing errors on every line that has ;="" in the attributes.  This is because it cannot apply the attribute name.

 

If you fix that, it still will not work, and is showing an empty nodeList. You can use

echo '<pre>' . print_r($dom,true) . '</pre>'; 

to see what the object contains.

 

Ok i see, so this isnt a good method afterall for geting data from html table, what should i use? As far as googleing goes everybody seems to like SimpleHtlmDom parser, but i had 0 sucess with it...

I built my own DOM class a while ago: http://tomsfreelance.com/DOMe/DOMe.phps

 

I tested it on your table and it worked. I can't guarantee it will always work. (Requires valid HTML.) Breaking it down will be up to you though.

 

Here's an example:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    echo "<pre>" . print_r($dom, true) . "</pre>";

You can use simpleHtmlDom.  The syntax would be:

 

 
<?php
include('path/to/simple_html_dom.php');
$dom = file_get_html('demo.htm');
 
$table = $dom->find('table',0);
 
$rows = $table->children(0)->children();
 
foreach($rows as $row) {
 foreach($row->children() as $column) {
  if(!empty($column->innertext)) {
   echo $column->innertext . '<br />' . PHP_EOL;
  }
 }
}
 
?>

I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    $rows = $dom->getElementsByTagName("tr");
    
    $data = array();
    foreach ($rows as $row) {
        $cells = $row->getElementsByTagName("td");
        $cellData = array();
        foreach ($cells as $cell) {
            $cellData[] = $cell->generate();
        }
        $data[] = $cellData;
    }
    
    echo "<pre>" . print_r($data, true) . "</pre>";

 

Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php

Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps

I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    $rows = $dom->getElementsByTagName("tr");
    
    $data = array();
    foreach ($rows as $row) {
        $cells = $row->getElementsByTagName("td");
        $cellData = array();
        foreach ($cells as $cell) {
            $cellData[] = $cell->generate();
        }
        $data[] = $cellData;
    }
    
    echo "<pre>" . print_r($data, true) . "</pre>";

 

Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php

Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps

This is really awesome and suits me best, thank you very much!!!

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.