Jump to content

Extract data from html table


Go to solution Solved by teynon,

Recommended Posts

Hi, how could i extract table data from html?

 

I tried simpleHTMLdom parses with no luck, then i found out that easyer method but its not working for me:

 

<?php
$data = file_get_contents('demo.htm');

$dom = new domDocument;

@$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');

$rows = $tables->item(1)->getElementsByTagName('tr');

foreach ($rows as $row) {
        $cols = $row->getElementsByTagName('td');
        //echo $cols[2];
print_r($cols);
}

?>

 

I get DOMNodeList Object ( [length] => 0 ) like theres nothing.
 
Html table is like this:
<table cellpadding="0px" cellspacing="0px" style="table-layout:fixed" ;="">
<tbody><tr>
<td width="20" style="min-width:20px;max-width:20px;"></td>
<td width="100" style="min-width:100px;max-width:100px;"></td>
<td width="150" style="min-width:150px;max-width:150px;"></td>
<td width="400" style="min-width:400px;max-width:400px;"></td>
<td width="200" style="min-width:200px;max-width:200px;"></td>
</tr>
<tr>
<td rowspan="5"></td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Aeg</b><br>15.12.2010</td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Koht</b><br>Harjumaa</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> BMW 525TDS, 1997</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 0%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> kuni 500 eurot</td>
</tr>
<tr>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Sõiduk:</b> OPEL ASTRA STATION WAGON, 2006</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Vastutuse ulatus:</b> 100%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"><b>Käsitlev kindlustusandja:</b> QBE Insurance (Europe) Limited Eesti filiaal</td>
</tr>

<tr>
<td rowspan="5"></td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Aeg</b><br>28.08.2010</td>
<td rowspan="5" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Koht</b><br>Tartu, Tartumaa</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> AUDI A4, 1996</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 0%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> 500 kuni 2000 eurot</td>
</tr>
<tr>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Sõiduk:</b> BMW 525TDS, 1997</td>
<td valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Vastutuse ulatus:</b> 100%</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-bottom:4px;background-color:#F5F5D1;"><b>Makstud sõidukikahju hüvitis:</b> sõidukikahju ei hüvitatud</td>
</tr>
<tr>
<td colspan="2" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;background-color:#F5F5D1;"><b>Käsitlev kindlustusandja:</b> If P&C Insurance AS</td>
</tr>

<tr>
<td></td>
<td colspan="4" valign="top" ,="" style="padding-left:3px;padding-top:4px;padding-bottom:4px;border-top-style:solid;border-width:1px;border-color=#4F4F4F;"> </td>
</tr>
</tbody></table>

 

 

How could i parse this table best way?

 

Link to comment
https://forums.phpfreaks.com/topic/275068-extract-data-from-html-table/
Share on other sites

Seems the domDocument() doesn't like the way the table is formed.  To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking.

 

Here is your first error:

 

DomDocument cannot read the file correctly.  It is throwing errors on every line that has ;="" in the attributes.  This is because it cannot apply the attribute name.

 

If you fix that, it still will not work, and is showing an empty nodeList. You can use

echo '<pre>' . print_r($dom,true) . '</pre>'; 

to see what the object contains.

Seems the domDocument() doesn't like the way the table is formed.  To be honest, I don't think anything I have ever fed it, was acceptable the without tweaking.

 

Here is your first error:

 

DomDocument cannot read the file correctly.  It is throwing errors on every line that has ;="" in the attributes.  This is because it cannot apply the attribute name.

 

If you fix that, it still will not work, and is showing an empty nodeList. You can use

echo '<pre>' . print_r($dom,true) . '</pre>'; 

to see what the object contains.

 

Ok i see, so this isnt a good method afterall for geting data from html table, what should i use? As far as googleing goes everybody seems to like SimpleHtlmDom parser, but i had 0 sucess with it...

I built my own DOM class a while ago: http://tomsfreelance.com/DOMe/DOMe.phps

 

I tested it on your table and it worked. I can't guarantee it will always work. (Requires valid HTML.) Breaking it down will be up to you though.

 

Here's an example:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    echo "<pre>" . print_r($dom, true) . "</pre>";

You can use simpleHtmlDom.  The syntax would be:

 

 
<?php
include('path/to/simple_html_dom.php');
$dom = file_get_html('demo.htm');
 
$table = $dom->find('table',0);
 
$rows = $table->children(0)->children();
 
foreach($rows as $row) {
 foreach($row->children() as $column) {
  if(!empty($column->innertext)) {
   echo $column->innertext . '<br />' . PHP_EOL;
  }
 }
}
 
?>
  • Solution

I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    $rows = $dom->getElementsByTagName("tr");
    
    $data = array();
    foreach ($rows as $row) {
        $cells = $row->getElementsByTagName("td");
        $cellData = array();
        foreach ($cells as $cell) {
            $cellData[] = $cell->generate();
        }
        $data[] = $cellData;
    }
    
    echo "<pre>" . print_r($data, true) . "</pre>";

 

Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php

Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps

Edited by teynon

I added a function "getElementsByTagName" so you can extract data easier. Here is how you might do it:

 

<?php
    require_once("DOMe.php");
    
    $dom = new DOMe("div");
    $dom->importHTML(file_get_contents("file.html"));
    
    echo $dom->generate();
    
    $rows = $dom->getElementsByTagName("tr");
    
    $data = array();
    foreach ($rows as $row) {
        $cells = $row->getElementsByTagName("td");
        $cellData = array();
        foreach ($cells as $cell) {
            $cellData[] = $cell->generate();
        }
        $data[] = $cellData;
    }
    
    echo "<pre>" . print_r($data, true) . "</pre>";

 

Output / example is at http://tomsfreelance.com/DOMe/DOM_Import.php

Make sure you get the updated code at http://tomsfreelance.com/DOMe/DOMe.phps

This is really awesome and suits me best, thank you very much!!!

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.