paul_hb Posted August 9, 2011 Share Posted August 9, 2011 Hi, I have a performance problem loading big csv-files into a table and take account of duplicate entries. The table has to be duplicate-free so the input csv-rows are checked against the table rows, so far so good. There are no duplicate rows in a single file, but between files. The fieldnames per file are identical to the main table. If a duplicate is found (determined by a single unique field per row), the remaining field values have to be merged according to a priority table. This table contains all filenames and a priority value for each field. So field1 may have value '2' for file1 and value '1' for file2, meaning file2 has higher priority for this field. My implementations were pretty slow (checking all priority values for each duplicate row in php), so I hope there is a better solution, maybe loading the files in a temp table and join it with the main table? Thanks in advance for any help. main table ID, field1, field2, field3,... id1, value1, value2, value3,… id2, value1, value2, value3,… ... priority table file, field1, field2, field3,… file1.csv, 2, 1, 1,… file2.csv, 1, 3, 2,… file3.csv, 3, 2, 3,… file4.csv, ... input csv files ID, field1, field2, field3,… id1, value1, value2, value3,… id2, value1, value2, value3,… ... Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/ Share on other sites More sharing options...
fenway Posted August 9, 2011 Share Posted August 9, 2011 Duplicates how? Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255001 Share on other sites More sharing options...
paul_hb Posted August 9, 2011 Author Share Posted August 9, 2011 The ID column is the relevant value. If the ID is already in the table, the row is considered a duplicate and has to be merged. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255027 Share on other sites More sharing options...
fenway Posted August 9, 2011 Share Posted August 9, 2011 Why not just have a unique key on ID, and decide based on priority to UPDATE or not? Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255067 Share on other sites More sharing options...
paul_hb Posted August 9, 2011 Author Share Posted August 9, 2011 Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it. Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255085 Share on other sites More sharing options...
paul_hb Posted August 10, 2011 Author Share Posted August 10, 2011 I thought it might work like this: SELECT p.file FROM priority p RIGHT JOIN maindb m ON m.file = p.file WHERE m.ID = '$duplicateID' HAVING MIN(p.field1) But 'p.field1' is of course only one of many columns in the table. This way I have to query every field-priority for each duplicate row. There has to be a better way I hope. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255242 Share on other sites More sharing options...
fenway Posted August 10, 2011 Share Posted August 10, 2011 Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it. Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible. I didn't say not to do it in MySQL -- there is "INSERT ON DUPLICATE KEY UPDATE". Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255255 Share on other sites More sharing options...
paul_hb Posted August 10, 2011 Author Share Posted August 10, 2011 But you would still have to check the prority for every field. That's the time consuming part. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255278 Share on other sites More sharing options...
fenway Posted August 10, 2011 Share Posted August 10, 2011 But you would still have to check the prority for every field. That's the time consuming part. I'm not sure what you mean -- have the priority as one of the columns, and just check if it's greater or not -- that's not time consuming at all. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255311 Share on other sites More sharing options...
paul_hb Posted August 10, 2011 Author Share Posted August 10, 2011 There are 200 columns, so if there are 500 duplicate rows (there will be much more), 100 000 checks are required to merge all rows. And that is of course very slow. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255335 Share on other sites More sharing options...
fenway Posted August 10, 2011 Share Posted August 10, 2011 But you said the duplicates are based on ID alone -- that's a single column, not 200. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255470 Share on other sites More sharing options...
paul_hb Posted August 11, 2011 Author Share Posted August 11, 2011 Yes, the ID is the potential duplicate. So if multiple IDs with the same value are found, the entire rows have to be merged into one row. The field values in the new row are determined based on the priorities of the files the old rows originated from. (There is a file column in the main table as well, forgot that in the first post.) Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255800 Share on other sites More sharing options...
fenway Posted August 11, 2011 Share Posted August 11, 2011 Merged? I thought if I higher priority row was found, you update everything. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1255880 Share on other sites More sharing options...
paul_hb Posted August 12, 2011 Author Share Posted August 12, 2011 No, the priority table has a value for every field of every file. That's the difficulty. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1256639 Share on other sites More sharing options...
fenway Posted August 14, 2011 Share Posted August 14, 2011 Then you should pre-process and merge the priorities first, then compare to the table. Quote Link to comment https://forums.phpfreaks.com/topic/244347-loading-files-into-db-without-duplicates/#findComment-1257189 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.