Jump to content

Loading files into DB without duplicates


paul_hb

Recommended Posts

Hi,

I have a performance problem loading big csv-files into a table and take account of duplicate entries.

The table has to be duplicate-free so the input csv-rows are checked against the table rows, so far so good. There are no duplicate rows in a single file, but between files. The fieldnames per file are identical to the main table.

If a duplicate is found (determined by a single unique field per row), the remaining field values have to be merged according to a priority table. This table contains all filenames and a priority value for each field. So field1 may have value '2' for file1 and value '1' for file2, meaning file2 has higher priority for this field.

My implementations were pretty slow (checking all priority values for each duplicate row in php), so I hope there is a better solution, maybe loading the files in a temp table and join it with the main table?

Thanks in advance for any help.

 

main table
ID, field1, field2, field3,...
id1, value1, value2, value3,…
id2, value1, value2, value3,…
...

priority table
file, field1, field2, field3,…
file1.csv, 2, 1, 1,…
file2.csv, 1, 3, 2,…
file3.csv, 3, 2, 3,…
file4.csv, ...

input csv files
ID, field1, field2, field3,…
id1, value1, value2, value3,…
id2, value1, value2, value3,…
...

 

Link to comment
Share on other sites

Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it.

Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible.

Link to comment
Share on other sites

I thought it might work like this:

 

SELECT p.file FROM priority p RIGHT JOIN maindb m ON m.file = p.file WHERE m.ID = '$duplicateID' HAVING MIN(p.field1)

 

But 'p.field1' is of course only one of many columns in the table. This way I have to query every field-priority for each duplicate row. There has to be a better way I hope.  :shrug:

Link to comment
Share on other sites

Yeah, but there are many fields in a row and for everyone the priority has to be checked. The files are really large and I need a faster way to do it.

Finding the duplicate rows is easy, but to merge them (can be multiple) is not so convenient. So I thought to do it directly in MySQL if possible.

I didn't say not to do it in MySQL -- there is "INSERT ON DUPLICATE KEY UPDATE".

Link to comment
Share on other sites

But you would still have to check the prority for every field. That's the time consuming part.

I'm not sure what you mean -- have the priority as one of the columns, and just check if it's greater or not -- that's not time consuming at all.

Link to comment
Share on other sites

Yes, the ID is the potential duplicate. So if multiple IDs with the same value are found, the entire rows have to be merged into one row.

The field values in the new row are determined based on the priorities of the files the old rows originated from. (There is a file column in the main table as well, forgot that in the first post.)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.