mort Posted November 25, 2010 Share Posted November 25, 2010 Hi all So... I am creating an import script for putting contacts into a database. The script we had worked ok for 500kb / 20k row CSV files, but anything much bigger than that and it started to run into the max execution limit. Rather than alter this I wish to create something that will run in the background and work as efficiently as possible. So basically the CSV file is uploaded, then you choose if the duplicates should be ignored / overwritten, and you match up the fields in the CSV (by the first line being a field title row), to the fields in the database. The field for the email address is singled out as this is to be checked for duplicates that already exist in the system. It then saves these values, along with the filename, and puts it all into an import queue table, which is processed by a CRON job. Each batch of the CRON job will look in the queue, find the first import that is incomplete, then start work on that file from where it left off last. When the batch is complete it will update the row to give a pointer in the file for the next batch, and update how many contacts were imported / how many duplicates there were So far so good, but when checking for duplicity it is massively slowing down the script. I can run 1000 lines of the file in 0.04 seconds without checking, but with checking that increases to 14-15 seconds, and gets longer the more contacts are in the db. For every line it tries to import its doing a SELECT query on the contact table, and although I am not doing SELECT * its still adding up to a lot of DB activity. One thought was to load every email address in the contacts table into an array before hand, but this table could be massive so thats likely to be just as inefficient. Any ideas on optimising this process? Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/ Share on other sites More sharing options...
seanlim Posted November 25, 2010 Share Posted November 25, 2010 Have you added an index for the email column? Maybe this will help too, since you won't need to do extra select statements and parse the data through PHP: http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html HTH Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139539 Share on other sites More sharing options...
mort Posted November 25, 2010 Author Share Posted November 25, 2010 I was actually just looking at this, only downsides I can see is that my email field isnt a primary key, the table is used by multiple sites so the emails can technically be duplicated, but only 1 per site_id Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139555 Share on other sites More sharing options...
PFMaBiSmAd Posted November 25, 2010 Share Posted November 25, 2010 You simply make the email and site_id columns a composite key. Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139559 Share on other sites More sharing options...
mort Posted November 25, 2010 Author Share Posted November 25, 2010 You simply make the email and site_id columns a composite key. Nice one That should do the trick. Lemme have a play Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139579 Share on other sites More sharing options...
mort Posted November 26, 2010 Author Share Posted November 26, 2010 only question now is, how can I count how many duplicates were found? Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139874 Share on other sites More sharing options...
PFMaBiSmAd Posted November 26, 2010 Share Posted November 26, 2010 In your error checking logic in your script, when the mysql_query statement fails and returns a false value, indicating an error occurred, mysql_errno will be a specific value when there was a duplicate key. I think it is 1062, but you should test to confirm what value you get by echoing mysql_errno in a test script. It's also possible that if you are using the IGNORE keyword in your query that using mysql_affected_rows will correctly indicate if the query actually inserted a row or not, but I don't know this for a fact and you would need to experiment to see if you get the correct indication out. Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1139925 Share on other sites More sharing options...
mort Posted December 2, 2010 Author Share Posted December 2, 2010 Thanks for the tip PFMaBiSmAd although I dont think this will work in my case, as the query is a batch insert dealing with multiple records in one query. Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1142180 Share on other sites More sharing options...
PFMaBiSmAd Posted December 2, 2010 Share Posted December 2, 2010 In another recent thread I used mysql_affected_rows() after a multi-value insert query, with the IGNORE keyword, and it does correctly indicate how many new rows were inserted. The number of duplicates would be the total count of data values being put into the query, less the number of affected rows. Quote Link to comment https://forums.phpfreaks.com/topic/219823-large-csv-import/#findComment-1142224 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.