Loading files into DB without duplicates

paul_hb · August 9, 2011

Hi,

I have a performance problem loading big csv-files into a table and take account of duplicate entries.

The table has to be duplicate-free so the input csv-rows are checked against the table rows, so far so good. There are no duplicate rows in a single file, but between files. The fieldnames per file are identical to the main table.

If a duplicate is found (determined by a single unique field per row), the remaining field values have to be merged according to a priority table. This table contains all filenames and a priority value for each field. So field1 may have value '2' for file1 and value '1' for file2, meaning file2 has higher priority for this field.

My implementations were pretty slow (checking all priority values for each duplicate row in php), so I hope there is a better solution, maybe loading the files in a temp table and join it with the main table?

Thanks in advance for any help.

main table
ID, field1, field2, field3,...
id1, value1, value2, value3,…
id2, value1, value2, value3,…
...

priority table
file, field1, field2, field3,…
file1.csv, 2, 1, 1,…
file2.csv, 1, 3, 2,…
file3.csv, 3, 2, 3,…
file4.csv, ...

input csv files
ID, field1, field2, field3,…
id1, value1, value2, value3,…
id2, value1, value2, value3,…
...

fenway · August 9, 2011

Duplicates how?

paul_hb · August 9, 2011

The ID column is the relevant value. If the ID is already in the table, the row is considered a duplicate and has to be merged.

fenway · August 9, 2011

Why not just have a unique key on ID, and decide based on priority to UPDATE or not?

paul_hb · August 9, 2011

Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it.

Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible.

paul_hb · August 10, 2011

I thought it might work like this:

SELECT p.file FROM priority p RIGHT JOIN maindb m ON m.file = p.file WHERE m.ID = '$duplicateID' HAVING MIN(p.field1)

But 'p.field1' is of course only one of many columns in the table. This way I have to query every field-priority for each duplicate row. There has to be a better way I hope. :shrug:

fenway · August 10, 2011

Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it.

Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible.

I didn't say not to do it in MySQL -- there is "INSERT ON DUPLICATE KEY UPDATE".

paul_hb · August 10, 2011

But you would still have to check the prority for every field. That's the time consuming part.

fenway · August 10, 2011

But you would still have to check the prority for every field. That's the time consuming part.

I'm not sure what you mean -- have the priority as one of the columns, and just check if it's greater or not -- that's not time consuming at all.

paul_hb · August 10, 2011

There are 200 columns, so if there are 500 duplicate rows (there will be much more), 100 000 checks are required to merge all rows.

And that is of course very slow.

fenway · August 10, 2011

But you said the duplicates are based on ID alone -- that's a single column, not 200.

paul_hb · August 11, 2011

Yes, the ID is the potential duplicate. So if multiple IDs with the same value are found, the entire rows have to be merged into one row.

The field values in the new row are determined based on the priorities of the files the old rows originated from. (There is a file column in the main table as well, forgot that in the first post.)

fenway · August 11, 2011

Merged? I thought if I higher priority row was found, you update everything.

paul_hb · August 12, 2011

No, the priority table has a value for every field of every file. That's the difficulty.

fenway · August 14, 2011

Then you should pre-process and merge the priorities first, then compare to the table.

Sign In

Loading files into DB without duplicates

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information