Jump to content

Extract large xml feed into mysql database


ravi_sdl

Recommended Posts

Hi,

 

I want to insert large xml feed data into mysql. XML feed size is 450mb. It will take approx 4-5 hrs to insert records in data base. I have write a code which are not working properly. Means it show maximum execution time error. It will insert only 20-25 thousand records into mysql table. I have already set "ini_set" code for maximum_execution_time but it is not working.

 

Anyone can suggest any tips. Please note that script will take 4-5 hrs to execute. Is any method to execute php code in backend or any setting in mysql. Please suggest me about this issue.

 

Regards,

Ravi

Link to comment
Share on other sites

Do not run it at once,  instead you run it in  parts, like process 10K records at once then write in database what was the last record processed, on next run, script should check in DB from where to start and then again it will process 10K records. We do such things regularly in PHP on large data. I hope you get the idea

Link to comment
Share on other sites

Also, it is very inefficient to run any kind of query inside of a loop. The time it takes php to send the query statement/data (when using prepared statements) to the database server is usually longer than the time it actually takes the database server to insert the data into the table.

 

By forming a multi-value insert query, where you insert as many rows of data as possible (limited by the maximum packet size for one query statement) at once, would be the quickest php based way of doing this. You would typically insert 2k-5k rows at one time.

 

An even more efficient way would be to parse the xml data into a csv file and use a LOAD DATA LOCAL INFILE query to get the data into a table at the highest speed possible. If this is your database server, where you can place the file into a location where the database server can directly read it without going through the client (php), a  LOAD DATA INFILE query (without the LOCAL keyword) would produce the fastest possible result.

 

From the mysql documentation -

When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements
Link to comment
Share on other sites

PFMabi I've tried the multiple insert query with large XML files and it is ghastly.

 

PHP chokes when it has to deal with an array that is > 2000 items and imagine each array having about 30 elements each.

 

To create the array that generates the query string it took like 5 min, so I guess the CSV option is much more viable.

Link to comment
Share on other sites

The most efficient way of using php and a multi-value insert query, is to get as much data as possible into an array (each entry is a formatted string of data ready to go between the () in the query), then use array_chunk to replace that array with a chunked version of itself (which uses the least amount of available memory and gets php to do the work instead of looping over each piece of data again), the chunk size is the number of rows you want to place into one query. You then loop over the chunks and implode each chunk to produce the data for the query statement.

 

If you exceed the maximum packet size in one query, you will get a 'server has gone away' error from your error checking logic.

 

For test data with only a few fields, I have inserted 50k rows in one query. For typical things you would see posted in the forum, I use a block size of 10k rows.

 

edit: added [m][/m] tags

Link to comment
Share on other sites

I remember reading something about this a well, where unless you've designed your filesystem specifically, you don't really gain performance over a certain number of concurrent inserts.

 

I don't think the number was much over a couple thousand rows at a time, where queries per second kind of flat-lined.

 

It might be worth it to do a similar benchmark on your system. You want to find the sweet spot, where you're inserting as few rows as possible at a time with the highest number of queries per second. (Should be very roughly around 1000 rows per query).

 

As PFMaB said, LOAD DATA INFILE is by far the fastest solution.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.