Jump to content

Recommended Posts

I've been trying to do this operation with a Java application, but I'm sure there's got to be a way to do this with mySQL.

 

I have a database (with about 2.2 million records) with thousands of duplicate entries. I only want to check if it's a duplicate with three fields: companyname, city, and state. If more than one record in the database has that same information, then delete the extra records. How would I go about doing this? (hopefully in an efficient way, since it'll probably take a while)

I've got a way to select the unique fields:

 

SELECT DISTINCT companyname,city,state from member;

 

So how do I go about deleting everything else?

 

I found some articles that suggest creating a new table and moving the distinct data there. I just don't know how to combine the insert statement and select statement above.

Sorry for the triple post but my modify button is missing...

 

I'm trying to use this query to insert the distinct rows into the new table, and I get the ensuing error:

 

mysql> insert into new_member select * from member where (select distinct companyname,city,state from member);
ERROR 1241 (21000): Operand should contain 1 column(s)

 

How do I fix this? I feel like the answer is so close. I want to copy all the data where companyname,city and state are unique.

If you use PHP (sorry, I'm only using that) you could store all ID which are not duplicated in an array ($aUniques). Then make an array of all the ID of the table ($aAllRows). There's a PHP array compare function that remove duplicate numbers from two arrays if I'm correct. So the last array would contain only the duplicate rows... or something like that. Yeah, it's a vague explanation of how I would do it but you should get the point... Hope it helps you! ;)

Hey mdnghtblue,

 

I ran into a near identical problem and was able to solve it.  Using this site as a guide, I was able to construct the query that removed duplicate records while leaving one in place.  Here was my solution altered from the query they showed:

 

select bad_rows.*
from contacts as bad_rows
   inner join (
      select email, MIN(contact_id) as min_id
      from contacts
      group by email
      having count(*) > 1
   ) as good_rows on good_rows.email = bad_rows.email
      and good_rows.min_id <> bad_rows.contact_id;

 

Now I only had about 20,000 records to go through and it took about 2 seconds so I don't know if this is too "brute force" for you but at least it's an option.

 

Thank you so much, I'm pretty sure this works. Here's the query I used:

 

delete bad_rows.* 
from member as bad_rows 
       inner join ( 
           select companyname,street,city,state, min(id) as min_id 
           from member 
           group by companyname,street,city,state 
           having count(*) > 1 
       ) as good_rows on good_rows.companyname = bad_rows.companyname 
         and good_rows.min_id <> bad_rows.id;

 

I tried it on really small tables and it worked. I'm trying it on a table with about 400,000 records and it's taken several hours already, guess I'm just gonna run it overnight.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.