Jump to content

[SOLVED] Extracting data from HTML table


bundyxc

Recommended Posts

I have 1,000+ rows of data that are all in the exact same format:

 

      <tr>
        <td>lastName, firstName</td>

        <td>email</td>

        <td>var1</td>

        <td>var2</td>
      </tr>

 

I need to be able to extract the data from that, so that I have five variables:

 

$lastName

$firstName

$email

$var1

$var2

 

How would I go about extracting this data? Is this a regex problem, or something that could just be solved with string functions? Thanks for your time.

Link to comment
Share on other sites

Time isn't a problem, so whatever's easier is better. ;)

I took a look at the page you linked to, and I don't get a thing. haha.

 

How would you do it with regular expressions? Or I mean, if you have an example o the DOM function, it would be appreciated.

Link to comment
Share on other sites

you might try something like this.

<?php
$data="
<table>
   <tr>
        <td>lastName, firstName</td>

        <td>email</td>

        <td>var1</td>

        <td>var2</td>
   </tr>
</table>";
$patterns[0] = '/<table>/';
$patterns[1] = '/</table>/';
$patterns[2] = '/</tr>/';
$patterns[3] = '/</td>/';
$replacements[0] = '';
$replacements[1] = '';
$replacements[2] = '';
$replacements[3] = '';
$data = preg_replace($patterns, $replacements, $data);
$rows = explode('<tr>', $data);
foreach($rows as $row){
list($lastName, $firstName, $email, $var1, $var2) = explode("<td>", $row);
}

?>

Link to comment
Share on other sites

just a small correction i forgot and explanation.

Step 1 = set $data to your table information

Step 2 = strip <table>, </table>, </tr>, and </td> out of the $data string

Step 3 = creates an array with each item as each row of the table

Step 4 = loops through array separating each <td> element and creating a variable with its contents

Step 5 = split the first <td> again because both names are held in the first <td>

 

then

<?php
//Step 1
$data="
<table>
   <tr>
        <td>lastName, firstName</td>

        <td>email</td>

        <td>var1</td>

        <td>var2</td>
   </tr>
</table>";
//Step 2
$patterns[0] = '/<table>/';
$patterns[1] = '/</table>/';
$patterns[2] = '/</tr>/';
$patterns[3] = '/</td>/';
$replacements[0] = '';
$replacements[1] = '';
$replacements[2] = '';
$replacements[3] = '';
$data = preg_replace($patterns, $replacements, $data);
//step 3
$rows = explode('<tr>', $data);
//step 4
foreach($rows as $row){
list($Name, $email, $var1, $var2) = explode("<td>", $row);
//step 5
list($lastName, $firstName) = explode(",", $Name);
}

?>

Link to comment
Share on other sites

Hmm, well since you have thousands of records I'm assuming you need put put the results into an array as you can't store multiple names into a single variable (i.e. $lastName).

 

No offense, but I see a couple problems with slapdashwebdesigner's code. For example, the regex expressions will fail since the forward slashes are not escaped. But, more importantly, the code assumes that ALL the text on the page is in fact part of the data. After all the table tags are stripped you would much of the structure.

 

The following code is more verbose, but has more logic in it. For example, you can set it to only look for data in a specific table - or leave as is and it will process all tables, but only the tables. The script will process an entire 'page' and put the results into a multidimensional array. Each element is a different record. See an example of the output at the end. Just add more options/conditions to the switch() as needed.

 

<?php

//Read the file as an array
$html = file('test.htm');

//Output for the resuls
$results = array();

//Vars for tracking the data
$inTable   = false;
$inRecord  = false;
$recordIdx = 0;
$dataIdx   = 0;

foreach($html as $line)
{
//echo "1";
    //Determine if inside of table
    if (!$inTable)
    {
        //If looking for a SPECIFIC table, add add'l verification
        //for example, you can check table name
        if (strpos($line, '<table')!==false)
        {
            $inTable = true;
        }
    }

    if ($inTable)
    {
        //Determine if in a new row/record
        if (!$inRecord && strpos($line, '<tr')!==false)
        {
            $inRecord  = true;
        }
        //Look for a data line
        if ($inRecord &&  strpos($line, '<td')!==false)
        {
            preg_match('/<td>(.*)<\/td>/', $line, $match);
            $data = trim($match[1]);
            switch($dataIdx)
            {
                case 0: //Last, First names
                    $results[$recordIdx]['lastName']  = trim(substr($data, 0, strpos($data, ',')));
                    $results[$recordIdx]['firstName'] = trim(substr($data, strpos($data, ',')+1));
                    break;
                case 1: //email
                    $results[$recordIdx]['email'] = $data;
                    break;
                case 2: //var1
                    $results[$recordIdx]['var1'] = $data;
                    break;
                case 3: //var2
                    $results[$recordIdx]['var2'] = $data;
                    break;
            }
            $dataIdx++;
        }
        //Determine if end of row/record
        if ($inRecord && strpos($line, '</tr')!==false)
        {
            $inRecord  = false;
            $recordIdx++;
            $dataIdx = 0;
        }
    }

    //Determine if end of row/record
    if ($inTable && strpos($line, '</table')!==false)
    {
        $inTable  = false;
    }
}

echo "<pre>";
print_r($results);
echo "</pre>";
?>

 

Example output

Array
(
    [0] => Array
        (
            [lastName] => Smith
            [firstName] => Bob
            [email] => bob@smith.com
            [var1] => male
            [var2] => 32
        )

    [1] => Array
        (
            [lastName] => jackson
            [firstName] => Michael
            [email] => michael@damato.net
            [var1] => pedo@death.com
            [var2] => 50
        )

    [2] => Array
        (
            [lastName] => Hayak
            [firstName] => Selma
            [email] => hottie@latin.com
            [var1] => female
            [var2] => 38
        )

    [3] => Array
        (
            [lastName] => Moore
            [firstName] => Demi
            [email] => demi@something.com
            [var1] => female
            [var2] => 46
        )

)

Link to comment
Share on other sites

Wow, the level of logic in there is incredible. Maybe a bit over my head. ;)

Thanks for the help though mjdamato. I haven't tested the code, as I've found a solution (OutWit Hub Addon for Firefox), but I'll definitely use this in the future. Thanks so much for your help.

Link to comment
Share on other sites

Here is also the example how to do it with DOMDocument.

 

<?php
$html = '<html>
<head>
<title>
</title>
</head>
<body>
<table name="data">
<tr>
        <td>LastName, FirstName</td>
        <td>email</td>
        <td>var1</td>
        <td>var2</td>
</tr>

<tr>
	<td>LastName_2, FirstName_2</td>
        <td>email_2</td>
        <td>var1_2</td>
        <td>var2_2</td>
</tr>
</table>
</body>
</html>';

// Create DOMDocument
$dom = new DOMDocument();

// Load html string
$dom->loadHTML($html);

// Get tables from html
$tables = $dom->getElementsByTagName('table');

// Get rows from tables
$rows = $tables->item(0)->getElementsByTagName('tr');

// Loop over each row
foreach ($rows as $row)
{
// Get each column by tag name
$cols = $row->getElementsByTagName('td');

// Echo values (here you can assign them in array for example)
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue.'<br />';
echo $cols->item(3)->nodeValue;
echo '<hr />';
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.