Large CSV Import

mort · November 25, 2010

Hi all

So... I am creating an import script for putting contacts into a database. The script we had worked ok for 500kb / 20k row CSV files, but anything much bigger than that and it started to run into the max execution limit. Rather than alter this I wish to create something that will run in the background and work as efficiently as possible.

So basically the CSV file is uploaded, then you choose if the duplicates should be ignored / overwritten, and you match up the fields in the CSV (by the first line being a field title row), to the fields in the database.

The field for the email address is singled out as this is to be checked for duplicates that already exist in the system.

It then saves these values, along with the filename, and puts it all into an import queue table, which is processed by a CRON job. Each batch of the CRON job will look in the queue, find the first import that is incomplete, then start work on that file from where it left off last. When the batch is complete it will update the row to give a pointer in the file for the next batch, and update how many contacts were imported / how many duplicates there were

So far so good, but when checking for duplicity it is massively slowing down the script. I can run 1000 lines of the file in 0.04 seconds without checking, but with checking that increases to 14-15 seconds, and gets longer the more contacts are in the db.

For every line it tries to import its doing a SELECT query on the contact table, and although I am not doing SELECT * its still adding up to a lot of DB activity. One thought was to load every email address in the contacts table into an array before hand, but this table could be massive so thats likely to be just as inefficient.

Any ideas on optimising this process?

seanlim · November 25, 2010

Have you added an index for the email column?

Maybe this will help too, since you won't need to do extra select statements and parse the data through PHP:

http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

HTH

mort · November 25, 2010

I was actually just looking at this, only downsides I can see is that my email field isnt a primary key, the table is used by multiple sites so the emails can technically be duplicated, but only 1 per site_id

PFMaBiSmAd · November 25, 2010

You simply make the email and site_id columns a composite key.

mort · November 25, 2010

You simply make the email and site_id columns a composite key.

Nice one That should do the trick. Lemme have a play

mort · November 26, 2010

only question now is, how can I count how many duplicates were found?

PFMaBiSmAd · November 26, 2010

In your error checking logic in your script, when the mysql_query statement fails and returns a false value, indicating an error occurred, mysql_errno will be a specific value when there was a duplicate key. I think it is 1062, but you should test to confirm what value you get by echoing mysql_errno in a test script.

It's also possible that if you are using the IGNORE keyword in your query that using mysql_affected_rows will correctly indicate if the query actually inserted a row or not, but I don't know this for a fact and you would need to experiment to see if you get the correct indication out.

mort · December 2, 2010

Thanks for the tip PFMaBiSmAd although I dont think this will work in my case, as the query is a batch insert dealing with multiple records in one query.

PFMaBiSmAd · December 2, 2010

In another recent thread I used mysql_affected_rows() after a multi-value insert query, with the IGNORE keyword, and it does correctly indicate how many new rows were inserted.

The number of duplicates would be the total count of data values being put into the query, less the number of affected rows.

Sign In

Large CSV Import

Recommended Posts

mort

Link to comment

Share on other sites

seanlim

Link to comment

Share on other sites

mort

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

mort

Link to comment

Share on other sites

mort

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

mort

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information