Jump to content

Archived

This topic is now archived and is closed to further replies.

andreastein

Parsing very large xml file

Recommended Posts

Hi,

I need to load and parse a very large XML file (over 5 million lines) daily, and then dump its contents into a database. I increased the max_execution_timeout to 5 minutes. However, the script seems to stop after 30 seconds or so. No error is printed, and it stops in a different place each time. I'm printing the execution time in my script, using ini_get, to make sure it's 300, and it is. Any other reasons this may happen, or tips for reading such a huge file? I tried using 2 scripts and passing the last byte read between them using fseek(), but b/c i'm passing a url to fopen, it looks like I'm out of luck there.

Any help much much appreciated!

Thx.

Share this post


Link to post
Share on other sites
Hey Andy! What in the world are you doing to process 5 million lines daily! I dont know if anyone has any experience with that. I am a beginner but what I would do if I were you, is store the last successfull line it reads into the database.. then have it refresh the page so it continually does your entire file until its finished.. had the same problem with something a good 1000 times smaller than what you are working on lol

ok bare with me, this might look confusing but it really isnt.

// this section should be at the very top of your script

$number_per_load = 100;
if (!isset($_GET['start'])) {
$_GET['start'] = 0;
}
if (!isset($_GET['redirect_number'])) {
$_GET['redirect_number'] = 0;
}
//ENDING TOP SECTION

Then in the $sql statement that is ganna query the database for this million lines of code you are going to need to add LIMIT ".$_GET['start'].", $number_per_load at the very end of that SQL statement. $number_per_load should be the last thing in your $sql statement.

then you are going to need to set $i = 0; one line above your while loop...
then run your while loop. inside your while loop, after all the code that you use to parse, you need to include this next bit of code

you are ganna have to get the line of data you were last successfull with and store it in a temporary table in your database like as follows

$sql_3 = " INSERT INTO `tmp_mail_count` VALUES ('$user_id') ";
$result_3 = mysql_query($sql_3);

$i++;

if ($i == $number_per_load) {
$start += $number_per_load;
$_GET['redirect_number']++;
if ($_GET['redirect_number'] > 10) {
header("Location: your_script_name_here.php?start=$start");
}
else {
header("Location: $_SERVER[PHP_SELF]?start=$start&redirect_number=$_GET[redirect_number]");
}
}

Did you follow all of that? What this is basically going to do is run through your entire SQL query without the page stalling because it is going to continue to refersh every time it hits the limit.. now you can increase the $number_per_load variable to higher than 100 ofcourse.. i used that because of the particular query i was using. Anyways, let me know how it works for you! & Please do tell me how much money you are making! 5 million lines of code a day should be a good $50K a month!

[!--quoteo(post=369011:date=Apr 26 2006, 05:32 PM:name=andyj)--][div class=\'quotetop\']QUOTE(andyj @ Apr 26 2006, 05:32 PM) [snapback]369011[/snapback][/div][div class=\'quotemain\'][!--quotec--]
Hi,

I need to load and parse a very large XML file (over 5 million lines) daily, and then dump its contents into a database. I increased the max_execution_timeout to 5 minutes. However, the script seems to stop after 30 seconds or so. No error is printed, and it stops in a different place each time. I'm printing the execution time in my script, using ini_get, to make sure it's 300, and it is. Any other reasons this may happen, or tips for reading such a huge file? I tried using 2 scripts and passing the last byte read between them using fseek(), but b/c i'm passing a url to fopen, it looks like I'm out of luck there.

Any help much much appreciated!

Thx.
[/quote]

Share this post


Link to post
Share on other sites

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.