[SOLVED] constructing a remove duplicate query

mdnghtblue · May 29, 2008

I've been trying to do this operation with a Java application, but I'm sure there's got to be a way to do this with mySQL.

I have a database (with about 2.2 million records) with thousands of duplicate entries. I only want to check if it's a duplicate with three fields: companyname, city, and state. If more than one record in the database has that same information, then delete the extra records. How would I go about doing this? (hopefully in an efficient way, since it'll probably take a while)

mdnghtblue · May 29, 2008

I've got a way to select the unique fields:

SELECT DISTINCT companyname,city,state from member;

So how do I go about deleting everything else?

I found some articles that suggest creating a new table and moving the distinct data there. I just don't know how to combine the insert statement and select statement above.

mdnghtblue · May 29, 2008

Sorry for the triple post but my modify button is missing...

I'm trying to use this query to insert the distinct rows into the new table, and I get the ensuing error:

mysql> insert into new_member select * from member where (select distinct companyname,city,state from member);
ERROR 1241 (21000): Operand should contain 1 column(s)

How do I fix this? I feel like the answer is so close. I want to copy all the data where companyname,city and state are unique.

blirette · May 30, 2008

If you use PHP (sorry, I'm only using that) you could store all ID which are not duplicated in an array ($aUniques). Then make an array of all the ID of the table ($aAllRows). There's a PHP array compare function that remove duplicate numbers from two arrays if I'm correct. So the last array would contain only the duplicate rows... or something like that. Yeah, it's a vague explanation of how I would do it but you should get the point... Hope it helps you!

86Stang · May 30, 2008

Hey mdnghtblue,

I ran into a near identical problem and was able to solve it. Using this site as a guide, I was able to construct the query that removed duplicate records while leaving one in place. Here was my solution altered from the query they showed:

select bad_rows.*
from contacts as bad_rows
   inner join (
      select email, MIN(contact_id) as min_id
      from contacts
      group by email
      having count(*) > 1
   ) as good_rows on good_rows.email = bad_rows.email
      and good_rows.min_id <> bad_rows.contact_id;

Now I only had about 20,000 records to go through and it took about 2 seconds so I don't know if this is too "brute force" for you but at least it's an option.

mdnghtblue · June 2, 2008

Thank you so much, I'm pretty sure this works. Here's the query I used:

delete bad_rows.* 
from member as bad_rows 
       inner join ( 
           select companyname,street,city,state, min(id) as min_id 
           from member 
           group by companyname,street,city,state 
           having count(*) > 1 
       ) as good_rows on good_rows.companyname = bad_rows.companyname 
         and good_rows.min_id <> bad_rows.id;

I tried it on really small tables and it worked. I'm trying it on a table with about 400,000 records and it's taken several hours already, guess I'm just gonna run it overnight.

Sign In

[SOLVED] constructing a remove duplicate query

Recommended Posts

mdnghtblue

Link to comment

Share on other sites

mdnghtblue

Link to comment

Share on other sites

mdnghtblue

Link to comment

Share on other sites

blirette

Link to comment

Share on other sites

86Stang

Link to comment

Share on other sites

mdnghtblue

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information