Jump to content

Recommended Posts

Hi,

So i have HTML table which i'm attempting to turn into executable SQL code which i can then insert into a SQL database table.

I'm wondering if anyone has any good ideas for the best way to get this done?

I need to extract the data from each of the <tr bgcolor="#EEEEEE">, scrape the text within each <td></td> and then format it in sql.

so far i have removed the html not required and created an array by exploding "<tr bgcolor="#EEEEEE">", however, its creating an empty element in the first array index [0], to fix this i have tried $players = array_shift($players); however it removed everything from the array.

next i was thinking i would need to remove all html within each array element and add a comma to the end, then i would need to compress it all back together again?

would this be the best way to do this?

source html

<html>
<head>
<title><br/></title> 
<style type ="text/css">
body,td,th {
font-family: Verdana, Arial, Helvetica, sans-serif;
font-size: 12px;
}
th {
padding: 5px;
text-align: left;
}
td {
padding: 2px;
border-style: solid;
border-width: 1px
}
</style>
</head>
<body>
<center><B><br/>
</center></B>
<p>
<table bordercolor="#000000" width="90%" align="center">
<tr bgcolor="#EEEEEE">
	<th>UID</th>

	<th>Name</th>

	<th>Nat</th>

	<th>Position</th>

	<th>Best Pos</th>

	<th>Age</th>

	<th>DoB</th>

	<th>Height</th>

	<th>Weight</th>

	<th>Expires</th>

	<th>Wage</th>

	<th>Value</th>

	<th>Aer</th>

	<th>Cmd</th>

	<th>Com</th>

	<th>Ecc</th>

	<th>Han</th>

	<th>Kic</th>

	<th>1v1</th>

	<th>Pun</th>

	<th>Ref</th>

	<th>TRO</th>

	<th>Thr</th>

	<th>Cor</th>

	<th>Cro</th>

	<th>Dri</th>

	<th>Fin</th>

	<th>Fir</th>

	<th>Fre</th>

	<th>Hea</th>

	<th>Lon</th>

	<th>L Th</th>

	<th>Mar</th>

	<th>Pas</th>

	<th>Pen</th>

	<th>Tck</th>

	<th>Tec</th>

	<th>Agg</th>

	<th>Ant</th>

	<th>Bra</th>

	<th>Cmp</th>

	<th>Cnt</th>

	<th>Dec</th>

	<th>Det</th>

	<th>Fla</th>

	<th>Ldr</th>

	<th>OtB</th>

	<th>Pos</th>

	<th>Tea</th>

	<th>Vis</th>

	<th>Wor</th>

	<th>Acc</th>

	<th>Agi</th>

	<th>Bal</th>

	<th>Jum</th>

	<th>Nat</th>

	<th>Pac</th>

	<th>Sta</th>

	<th>Str</th>

</tr>
<tr bgcolor="#EEEEEE">
	<td>859596</td>

	<td>Tim Krul</td>

	<td>NED</td>

	<td>GK</td>

	<td>GK</td>

	<td>31</td>

	<td>3/4/1988 (31 years old)</td>

	<td>194 cm</td>

	<td>83 kg</td>

	<td>30/6/2022</td>

	<td>£25,000 p/w</td>

	<td>£12.75M</td>

	<td>15</td>

	<td>13</td>

	<td>14</td>

	<td>8</td>

	<td>13</td>

	<td>13</td>

	<td>15</td>

	<td>12</td>

	<td>16</td>

	<td>15</td>

	<td>12</td>

	<td>1</td>

	<td>2</td>

	<td>1</td>

	<td>3</td>

	<td>8</td>

	<td>4</td>

	<td>5</td>

	<td>3</td>

	<td>2</td>

	<td>2</td>

	<td>9</td>

	<td>4</td>

	<td>3</td>

	<td>8</td>

	<td>11</td>

	<td>12</td>

	<td>14</td>

	<td>13</td>

	<td>13</td>

	<td>12</td>

	<td>14</td>

	<td>2</td>

	<td>14</td>

	<td>2</td>

	<td>15</td>

	<td>10</td>

	<td>5</td>

	<td>13</td>

	<td>11</td>

	<td>13</td>

	<td>14</td>

	<td>17</td>

	<td>13</td>

	<td>8</td>

	<td>14</td>

	<td>10</td>

</tr>
<tr bgcolor="#EEEEEE">
	<td>5204730</td>

	<td>Michael McGovern</td>

	<td>NIR</td>

	<td>GK</td>

	<td>GK</td>

	<td>34</td>

	<td>12/7/1984 (34 years old)</td>

	<td>188 cm</td>

	<td>86 kg</td>

	<td>30/6/2021</td>

	<td>£8,000 p/w</td>

	<td>£875K</td>

	<td>12</td>

	<td>11</td>

	<td>10</td>

	<td>2</td>

	<td>12</td>

	<td>12</td>

	<td>12</td>

	<td>11</td>

	<td>13</td>

	<td>8</td>

	<td>10</td>

	<td>5</td>

	<td>2</td>

	<td>1</td>

	<td>3</td>

	<td>9</td>

	<td>7</td>

	<td>6</td>

	<td>3</td>

	<td>3</td>

	<td>2</td>

	<td>7</td>

	<td>1</td>

	<td>2</td>

	<td>5</td>

	<td>7</td>

	<td>12</td>

	<td>14</td>

	<td>13</td>

	<td>13</td>

	<td>11</td>

	<td>17</td>

	<td>1</td>

	<td>12</td>

	<td>7</td>

	<td>12</td>

	<td>15</td>

	<td>10</td>

	<td>15</td>

	<td>10</td>

	<td>13</td>

	<td>12</td>

	<td>12</td>

	<td>12</td>

	<td>8</td>

	<td>15</td>

	<td>14</td>

</tr>
</table>
<p>
<p align="center">
</body>
</html>

 

code i've created so far:

<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end)
{
    $data = stristr($data, $start); // Stripping all data from before $start
    $data = substr($data, strlen($start));  // Stripping $start
    $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
    $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
    return $data;   // Returning the scraped data from the function
}

$myfile = fopen("source.html", "r") or die("Unable to open file!");
$page = fread($myfile,filesize("source.html"));

$page = scrape_between($page, '</tr>', '</table>');

$players = explode('<tr bgcolor="#EEEEEE">', $page);

echo '<pre>'; print_r($players); echo '</pre>';

fclose($myfile);
?>

 

Edited by seany1234

Try using domDocument or simple_html_dom.

EG

include("/path/to/simple_html_dom.php");

$html = file_get_html('my_source.html');

$rows = $html->find('tr[bgcolor="#EEEEEE"]');

$csvFile = fopen('mydata.csv', 'w');
foreach ($rows as $row) {
    $data = [];
    foreach ($row->children() as $itm) {
        $data[] = $itm->plaintext;
    }
    fputcsv($csvFile, $data);
}
fclose($csvFile);

The csv file can then be loaded into a sql table using a LOAD DATA INFILE statement.

Edited by Barand
  • Like 1

P.S.

Some of the data is unsuitable for data storage purposes and needs to be reformatted before storing in a database table

+------------+--------------------------------+----------------+----------------------------------+
| Column     |        WRONG                   |   CORRECT      |  Comments                        |
+------------+--------------------------------+----------------+----------------------------------+
| Age        | 31                             |  Not required  |  DoB tells you the age           |
| DoB        | 3/4/1988 (31 years old)        |  1988-04-03    |  Store in correct date format    |
| Height     | 194 cm                         |  194           |  Store unformatted numeric value |
| Weight     | 83 kg                          |  83            |  Store unformatted numeric value |
| Expires    | 30/6/2022                      |  2022-06-30    |  Store in correct date format    |
| Wage       | £25,000 p/w                    |  25000         |  Store unformatted numeric value |
| Value      | £12.75M                        |  12750000      |  Store unformatted numeric value |
+------------+--------------------------------+----------------+----------------------------------+

 

26 minutes ago, seany1234 said:

how would it be possible to format data

You would do the required formatting before writing to the csv file.

 

26 minutes ago, seany1234 said:

remove the first row?

LOAD DATA INFILE command has a "IGNORE N rows" option for ignoring headings.

If you don't want to use that LOAD DATA option (which is certainly the fastest and will load hundreds of records per second) then , with a PDO connection and a prepared statement you could change the loop to

$stmt = $conn->prepare("INSERT INTO thetable (uid, ... , etc)" VALUES(?,?,?, ...., etc));

foreach ($rows as $row) {
    $data = [];
    foreach ($row->children() as $itm) {
        $val = $itm->plaintext;
// do any required reformatting here
        $data[] = $val;
    }
    $stmt->execute($data);
}

 

Edited by Barand

Thanks for your reply, i think i will go with the second option, these html tables will only be 50 rows max, so speed isn't particularly required.

i'm getting a little confused now as i haven't used ->children or ->plaintext before.

how can i do the formatting on a specific data field?

i was originally thinking i could do something like

$val[1] = formatted text;

but that wouldn't work.

Something along these lines...

include("path/to/simple_html_dom.php");

$html = file_get_html('mysource.html');

$rows = $html->find('tr[bgcolor="#EEEEEE"]');

$stmt = $conn->prepare("INSERT INTO thetable (uid, ... , etc)" VALUES(?,?,?, ...., etc));

foreach ($rows as $row) {
    if (strpos($row->children(0), '<th>')!==false) continue;
    $data = [];
    foreach ($row->children() as $k => $itm) {
        $val = $itm->plaintext;
        switch ($k) {
            case 6:                                 // DoB
            case 9:                                 // Expires
                $data[] = reformatDate($val);
                break;
            case 7:                                 // Height
            case 8:                                 // Weight
            case 10:                                // Wage
                $data[] = numOnly($val);
                break;
            case 11:                                // Value
                $data[] = reformatValue($val); 
                break;    
            default: 
                $data[] = $val;
        }
    }
    $stmt->execute($data);
    
}

function reformatDate($str)
{
    list($dob) = explode(' ', $str);
    $res = (DateTime::createFromFormat('d/m/Y', $dob))->format('Y-m-d') ;
    return $res;
}

function numOnly($str)
{
    $res = '';
    for($i=0, $k=strlen($str); $i<$k; $i++) {
        $res .= ctype_digit($str[$i]) ? $str[$i] : '';
    }
    return $res;
}

function reformatValue($str)
{
    //
    // This is left as an exercise for the reader!
    //
    return $str;      // leave unchanged for now
}

 

Edited by Barand
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.