Jump to content

Recommended Posts

Hi all. I am reinventing my interest in coding with a little project I have created for myself. I am now at a point where I need to reach out to this community of experts for guidance on my next move with a specific task.

I have a file that is extracted from a system in .html format and I need to write php code to extract the data from the table rows/cells so that I can populate the MySQL database rows with the data.

As I say, the file is extracted to my local machine. The structure of the .html file I have no control over so I am unable to make at adjustments to it. Annoyingly, I have just noticed that the first <tr> row of headers is built with the <td> and not the <th>. Also, there is no closing </tr> on the first row.

Some may suggest to open this in Excel and then save down as a .csv file, which is something I would prefer not to do. The less manual manipulation the better for me.

Anyway, this is the structure of the html file:

<HTML><HEAD></HEAD><BODY>
<H1>Data Report</H1><BR>
<BR>
<BR>
<P>
               Reporting Level : Admin<BR>
             Reporting Context : 2555<BR>
                  Company Name : <BR>
                 Register Type : All<BR>
                 Summary Level : Trans Level<BR>
                       Product : All<BR>
                     Date High : 25-AUG-2018<BR>
               Transfer Status : Transferred<BR>
            Transaction Number : <BR>
            Matrix Report Flag : Y<BR>
     Include Referenced Source : Y<BR>
<P>
<TABLE BORDER=1><TR><TD><B>Col_1</B></TD><TD><B>Col_2</B></TD><TD><B>Col_3</B></TD><TD><B>Col_4</B></TD><TD><B>Col_5</B></TD><TD><B>Col_6</B></TD></TD>
<TR><TD>1.01</TD><TD>1.02</TD><TD>1.03</TD><TD>myText1-1</TD><TD>myText1-2</TD><TD>1</TD></TR>
<TR><TD>2.01</TD><TD>2.02</TD><TD>2.03</TD><TD>myText2-1</TD><TD>myText2-2</TD><TD>2</TD></TR>
<TR><TD>3.01</TD><TD>3.02</TD><TD>3.03</TD><TD>myText3-1</TD><TD>myText3-2</TD><TD>3</TD></TR>
<TR><TD>4.01</TD><TD>4.02</TD><TD>4.03</TD><TD>myText4-1</TD><TD>myText4-2</TD><TD>4</TD></TR>
<TR><TD>5.01</TD><TD>5.02</TD><TD>5.03</TD><TD>myText5-1</TD><TD>myText5-2</TD><TD>5</TD></TR>
<TR><TD>6.01</TD><TD>6.02</TD><TD>6.03</TD><TD>myText6-1</TD><TD>myText6-2</TD><TD>6</TD></TR>
<TR><TD>7.01</TD><TD>7.02</TD><TD>7.03</TD><TD>myText7-1</TD><TD>myText7-2</TD><TD>7</TD></TR>
<TR><TD>8.01</TD><TD>8.02</TD><TD>8.03</TD><TD>myText8-1</TD><TD>myText8-2</TD><TD>8</TD></TR>
<TR><TD ALIGN=LEFT COLSPAN=29><B>Report Count</B></TD></TR>
<TR><TD></TD><TD></TD><TD></TD><TD></TD><TD></TD><TD>8</TD></TR>
</TABLE>
<BR>
*** End of Report ***
</BODY>
</HTML>

What I want to do of course is create a php script that will interrogate the html file and then pull the data rows into an array which can then be inserted into a pre-built MySQL table.

Appreciate all the guidance you can provide.

Thanks in advance

.....

You might want to have a look at simple_html_dom class.

This isn't whole solution but should get you started

<?php
include('simple_html_dom.php');

$html = file_get_html('sample.html');

$recs = [];
foreach ($html->find('tr') as $k=>$row) {
    foreach ($row->find('td') as $cell) {
        $recs[$k][] = $cell->plaintext;
    }
}

echo '<pre>', print_r($recs, 1), '</pre>';
?>

 

  • Like 1

Some progress made by researching file handling in php.

In terms of grabbing the file content for handling, it seems I can go with a couple of options...

<?php

$myfile = fopen("the_file.htm", "r");

while(!feof($myfile)) {
	$line = fgets($myfile);
	// do something with $line;
}

?>

or...

<?php

$myfile = file("the_file.php");

foreach ($myfile as $line_num => $line) {
    
	// do something with $line;
}

?>

So with each option, I need to be able to do a couple things.

- Firstly, identify the first line of the row of table data. As I understand it, everything above the first data row in the table is 'fixed', so this row should also start at line 18. Identifying by line number here would work.

  Alternatively, I guess I could use php to string search somehow and identify a keyword/marker of some description; perhaps the first column name in the html table, and then    tell the script to start collecting the data from all subsequent lines?

- Secondly, I need to be able to parse the data in the html table row somehow to be able to insert each column into the db table. Unfortunately there is no distinguishing character that I can 'explode' the row of data with. All I seem to have are the <TR><TD></TD></TR> tags wrapping the data, so not sure how I would handle this one.

.....

 

Well, I've been playing with the simple_html_dom tool but I'm not sure whether it is no longer supported or just well out of date, but all I get is a bucket full of errors when trying the simplest tasks.

I even downloaded the full model and ran their test scenarios without touching anything and they also failed with a bunch of errors in the included file.

Here was an example when running their scrape script for imdb. Unfortunately I'm not advanced enough to debug.

Warning: file_get_contents(): stream does not support seeking in C:\Server\data\htdocs\simplehtmldom_1_5\simple_html_dom.php on line 75

Warning: file_get_contents(): Failed to seek to position -1 in the stream in C:\Server\data\htdocs\simplehtmldom_1_5\simple_html_dom.php on line 75

Fatal error: Uncaught Error: Call to a member function find() on bool in C:\Server\data\htdocs\simplehtmldom_1_5\example\scraping\example_scraping_imdb.php:9 Stack trace: #0 
C:\Server\data\htdocs\simplehtmldom_1_5\example\scraping\example_scraping_imdb.php(47): scraping_IMDB('http://imdb.com...') #1 {main} thrown in 
C:\Server\data\htdocs\simplehtmldom_1_5\example\scraping\example_scraping_imdb.php on line 9

.....

and this is the error if I try to run basic scrape on the example html in my op, above...

Warning: preg_match_all(): Compilation failed: invalid range in character class at offset 4 in C:\Server\data\htdocs\simplehtmldom_1_5\simple_html_dom.php on line 684Warning: Invalid argument supplied for foreach() in C:\Server\data\htdocs\simplehtmldom_1_5\simple_html_dom.php on line 69

This is the basic code I used:

<?php
include_once('inc/simple_html_dom.php');

echo file_get_html('sample.htm')->plaintext;
?>

 

The code I posted worked on your sample html. The only problem was the missing "</TR>" on the header row (which I sidestepped by ignoring the the first row)

if ($k==0 continue);  // at start of loop

My output was

Array
(
    [1] => Array
        (
            [0] => 1.01
            [1] => 1.02
            [2] => 1.03
            [3] => myText1-1
            [4] => myText1-2
            [5] => 1
        )

    [2] => Array
        (
            [0] => 2.01
            [1] => 2.02
            [2] => 2.03
            [3] => myText2-1
            [4] => myText2-2
            [5] => 2
        )

    [3] => Array
        (
            [0] => 3.01
            [1] => 3.02
            [2] => 3.03
            [3] => myText3-1
            [4] => myText3-2
            [5] => 3
        )

    [4] => Array
        (
            [0] => 4.01
            [1] => 4.02
            [2] => 4.03
            [3] => myText4-1
            [4] => myText4-2
            [5] => 4
        )

    [5] => Array
        (
            [0] => 5.01
            [1] => 5.02
            [2] => 5.03
            [3] => myText5-1
            [4] => myText5-2
            [5] => 5
        )

    [6] => Array
        (
            [0] => 6.01
            [1] => 6.02
            [2] => 6.03
            [3] => myText6-1
            [4] => myText6-2
            [5] => 6
        )

    [7] => Array
        (
            [0] => 7.01
            [1] => 7.02
            [2] => 7.03
            [3] => myText7-1
            [4] => myText7-2
            [5] => 7
        )

    [8] => Array
        (
            [0] => 8.01
            [1] => 8.02
            [2] => 8.03
            [3] => myText8-1
            [4] => myText8-2
            [5] => 8
        )

    [9] => Array
        (
            [0] => Report Count
        )

    [10] => Array
        (
            [0] => 
            [1] => 
            [2] => 
            [3] => 
            [4] => 
            [5] => 8
        )

)

 

If simple_html_dom doesn't want to play, use DOMDocument instead.

$dom = new DOMDocument();
$dom->loadHTMLFile('sample.html');

foreach( $dom->getElementsByTagName('tr') as $k => $row ) {
    if ($k==0) continue;
    foreach ($row->childNodes as $cell) {
        $recs[$k][] = $cell->textContent;
    }
}

echo '<pre>', print_r($recs, 1), '</pre>';

 

Neither of them play, unfortunately. I can't get the simple_htm_dom to work, however I try, I'm wondering if it is the php version (7.3.0RC2).

The DOMDocument just throws up more errors, which I've figured out is due to the fact that the main file I will be scraping is not html perfect. It has a lot of text values that contain "&" rather than "&amp;". I did a quick replace to test the file again, but then next obstacle...

"Fatal error: Maximum execution time of 30 seconds exceeded......"

Probably worth a mention that the original file for scraping is 32,976 rows.

....

That's a little out of my scope at the moment, but I have managed to make some progress in terms of scraping the data. It's probably not the most efficient way of doing it and is quite procedural, but it's where I am at the moment.

Any feedback you are able to provide on my approach to this is greatly appreciated.

<?php

// create an array variable of the file contents
$arrMyFile = file('myDataFile.htm');

/*	Create a variable that represents the last row of the file
	containing the data data we need (3 rows up from absolute last row) */
$intLastRow = count($arrMyFile) - 3;

/* .... To Do: return total number of file rows and total number of
   .... data rows to user for file validation.
   .... Also potentially include the extraction of a $string 'marker' from the
   .... file to validate that it is the correct file e.g. a table column header.
*/

/*	Create an array of values containing the htmlentities for each table
	field we need to purge from the source file. */
/*	We need to leave in the </TD> tag which we will replace with a "," later
	but remove the end of line "</TD></TR>", as we do not want a comma there */
$arrReplaceTags = array("<TR>", "<TD>", "</TD></TR>");

//	Iterate through the file array, line by line, for processing.
foreach ($arrMyFile as $intRowNum => $strRowVal) {

	// 	We only want to process lines that are in the data table, not inc headers
	/* 	Currently assuming that the table data always starts at the same line (48)
		and continue to process up to the $lastRow row. */
	if($intRowNum > 47 && $intRowNum < $intLastRow) {
		
		/*	Convert the HTML TABLE tags on each row to htmlentities for easier parsing. */
		/*	For some reason the str_replace function does not like the < or > chars. */
		$strRowVal = htmlentities($strRowVal);
		
		// Now replace any TABLE tags in the line with the the comma (except for </TD)
		$strRowVal = str_replace($arrReplaceTags, "", $strRowVal);
		
		// Now replace the </TD> tag with the comma for clean cell separation
		$strRowVal = str_replace("</TD>", ",", $strRowVal);
		
		// Now explode the current row string into an array for populating into the DB.
		$arrRowVals = explode(',', $strRowVal);

		/*	Now we have to process the array for populating into the DB row (mixed data columns)
			which I am currently researching the best method for.
			
			Must include sanitization, even though data from source extract is considered 'safe'.
			I do not manage the data source environment so cannot always exclusively assure 
			that it has not been compromised.
			
			Also return a count of inserted rows for comparison with number of data rows in
			the original file for validation of successful import.
		*/
	}
}

?>

 

Edited by DM67
cleaned php code

Does this rather crude parser work for you? The output from your sample was

--------------------------------------------
DATA FILE: "data_10031453_1.csv"
--------------------------------------------
Col_1,Col_2,Col_3,Col_4,Col_5,Col_6
1.01,1.02,1.03,myText1-1,myText1-2,1
2.01,2.02,2.03,myText2-1,myText2-2,2
3.01,3.02,3.03,myText3-1,myText3-2,3
4.01,4.02,4.03,myText4-1,myText4-2,4
5.01,5.02,5.03,myText5-1,myText5-2,5
6.01,6.02,6.03,myText6-1,myText6-2,6
7.01,7.02,7.03,myText7-1,myText7-2,7
8.01,8.02,8.03,myText8-1,myText8-2,8
"Report Count"
,,,,,8

Code:

<?php
$html = file_get_contents('sample.html');

$hsize = strlen($html);
$tend = 0;
$tableNo = 1;
$dtm = date('mdHi'); 
while (($tstart = stripos($html, '<table', $tend)) !== false) {
    $tend = stripos($html, '</table>', $tstart);
    $records = processTable( substr($html, $tstart, $tend+8-$tstart));
    $filename = "data_{$dtm}_{$tableNo}.csv";
    
    // store each table's data in a csv file
    $fp = fopen($filename, 'w');
    foreach ($records as $rec) {
        fputcsv($fp, $rec);
    }
    fclose($fp);
    
    // display the found data from each html table
    echo "<hr>DATA FILE: \"$filename\"<hr><pre>\n";
    readfile($filename);
    echo "</pre>\n";
    
    $tableNo++;
}


function processTable($tabStr)
{
    $rows = [];
    $rowno = 0;
    $rend = 0;
    $tsize = strlen($tabStr);
    while (($rstart = stripos($tabStr, '<tr', $rend)) !== false) {
        $rend = stripos($tabStr, '<tr', $rstart+4);
        if ($rend===false) $rend = $tsize;
        $rows[$rowno++] = processRow(substr($tabStr, $rstart, $rend-$rstart));
    }
    return $rows;
}

function processRow($rowStr)
{
    $cells = [];
    $cend = 0;
    $rsize = strlen($rowStr);
    while (($cstart = stripos($rowStr, '<td', $cend)) !== false) {
        $cend = stripos($rowStr, '<td', $cstart+4);
        if ($cend===false) $cend = $rsize;
        $cells[] = trim(strip_tags(substr($rowStr, $cstart, $cend-$cstart)));
    } 
    return $cells;
}
?>

 

Thanks, @Barand

I'll have a play with that later this evening. I want to be able to try and understand what the code is doing rather than just copy/paste and be thankful.

In the meantime, the code in my post from earlier seems to be doing the trick. I have had a play with some basic sql to populate the db from the array I created. It is very basic and to me seems very inefficient to be parsing 39k+ lines of data line by line, but I am struggling to get my head round any other way.

The actual processing of each line to turn it into a 'clean', comma separated string seems fairly quick, as I can rewrite it to a .csv file in a matter of seconds.

The roadblock comes when I start to prep and load the data - line by line - into the db. I have just increased my timeout from 30 to 300 seconds and it still only gets to 6k+ rows of data before it times out.

On with the research...... I should have taken up golf!  :D

The fastest way to load data is with "LOAD DATA INFILE" statement directly loading the csv.

The next quickest is with multi-record inserts EG

INSERT INTO tableA (col1, col2, col3) VALUES ('A', 'B', 'C'), ('D' ,'E' ,'F'), ('G', 'H', 'I');

write about 1000 recs at a time.

The slowest way is to loop through data inserting a single row at a time.

I have a slightly off-topic question for you OP. A source that only has an html output doesn't sound like they intend for you to take this data for any other purpose. It sounds like you may be trying to scrape a page somewhere. Do you even have permission from the data source to do this?

15 hours ago, benanamen said:

I have a slightly off-topic question for you OP. A source that only has an html output doesn't sound like they intend for you to take this data for any other purpose. It sounds like you may be trying to scrape a page somewhere. Do you even have permission from the data source to do this?

I do indeed. The file is .htm format but is in fact the resulting extract from a system and stored locally on my machine. It seems that it is not possible to extract in .csv format (which seems odd to me) but there you go, .htm it is and .htm I have to try and work with ?

I don't actually perform the data extract, but I can tell you that it comes out of the Oracle ERP system. It would appear that when extracting as .csv format, it generally times out, but has no issue as .htm format.

Anyway. I have now managed to scrape the data, parse it and push it into the db. There is still some housekeeping to do,  but the following completes the task, start to end in about 8 seconds.

<?php

$file = file('data.htm');

$eof = count($file) - 3;

$replace = array('&lt;TR&gt;', '&lt;TD&gt;');

foreach ($file as $n => $v) {

	if ($n > 47 && $n < $eof) {
		
		$v = htmlentities($v);
		$v = str_replace($replace, '', $v);
		$v = str_replace('&lt;/TD&gt;&lt;/TR&gt;', '~', $v);
		$v = str_replace('&lt;/TD&gt;', '|', $v);
		file_put_contents('output.txt', $v, FILE_APPEND | LOCK_EX);
	}
}

include('_inc/dbConn.php');

$filea = 'output.txt';

$sql = "LOAD DATA LOCAL INFILE '$filea' INTO TABLE ftr092018 COLUMNS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '~'";

mysqli_query($dbConn, $sql) or die(mysqli_error($dbConn));
?>

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.