Dear All,
I have been trying to speed up a script I wrote that takes the content of a webpage (it's a SAP BI Web Query if anyone's wondering) and parses through it reading the content of a particular table and inserting into a mysql database table.
Now before anyone asks, the problem has nothing to do with mysql.
What I have found to be a performance problem is this bit of code :-
$doc = new DOMDocument();
@$doc->loadHTMLFile("/tmp/test.html");
$tables = $doc->getElementsByTagName('table');
for ($i=0; $i < $tables->length; ++$i) {
if ($tables->item($i)->getAttribute('name') == "GR1Table")
{
$table = $tables->item($i);
$find_all_tables_time = time() + microtime();
break;
}
}
// get all rows from the first table
$rows = $table->getElementsByTagName('tr');
set_time_limit(1200);
$rowcount = $rows->length;
$insert_limit = 1000;
// iterate over all but the first row
for ($i = 1; $i <= $insert_limit; ++$i)
{
$row = $rows->item($i)->textContent;
if ($i == $insert_limit)
{
$loop_1 = time() + microtime();
echo "Rows = " . $insert_limit . "\tTime:\t " . round($loop_1 - $start_time, 4) . "<br/>";
if ($insert_limit < $rowcount)
{
$insert_limit += 500;
}
}
}
I'll try to explain as best I can. The html document contain 8163 rows in the table I am parsing and using PHP just to loop over the rows and assigning the textContent to a variable becomes slower the more rows you process. Here is the resulting output of the above code.
Rows = 1000 Time: 2.1187
Rows = 1500 Time: 3.5387
Rows = 2000 Time: 5.7027
Rows = 2500 Time: 8.5453
Rows = 3000 Time: 12.0353
Rows = 3500 Time: 16.1604
Rows = 4000 Time: 20.917
Rows = 4500 Time: 26.3189
Rows = 5000 Time: 32.3477
Rows = 5500 Time: 39.0064
Rows = 6000 Time: 46.2941
Rows = 6500 Time: 54.2183
Rows = 7000 Time: 62.8111
Rows = 7500 Time: 72.1436
Rows = 8000 Time: 81.9869
Rows = 8500 Time: 92.3004
Now that's slow, the time is in seconds!
Here's the same code written in Javascript together with it's results :-
function rowloop()
{
var mytabs = document.getElementsByTagName("TABLE");
for (i=0;i<mytabs.length;++i)
{
if (mytabs[i].getAttribute("name") == "GR1Table")
{
mytab = mytabs[i];
break;
}
}
var myrows = mytab.getElementsByTagName("TR");
var rowcount = myrows.length;
var insert_limit = 1000;
var startTime = new Date();
for (i=1;i <= insert_limit;++i)
{
var myrow = myrows[i].textContent;
if (i == insert_limit)
{
var endTime = new Date();
var totalTime = endTime-startTime;
document.write("Rows = " + insert_limit + " Time: " + totalTime + "ms<br/>");
if (insert_limit < rowcount)
{
insert_limit += 500;
}
}
}
}
And here's the results for the javascript code :-
Rows = 1000 Time: 16ms
Rows = 1500 Time: 643ms
Rows = 2000 Time: 650ms
Rows = 2500 Time: 658ms
Rows = 3000 Time: 666ms
Rows = 3500 Time: 674ms
Rows = 4000 Time: 682ms
Rows = 4500 Time: 690ms
Rows = 5000 Time: 698ms
Rows = 5500 Time: 706ms
Rows = 6000 Time: 789ms
Rows = 6500 Time: 797ms
Rows = 7000 Time: 805ms
Rows = 7500 Time: 813ms
Rows = 8000 Time: 821ms
Can anyone explain this for me as it's driving me up the wall?
I am running php 5.2.5 but have also tried 5.2.4 with no difference to the result.
Thanks in advance.
Craig