mdnghtblue Posted May 29, 2008 Share Posted May 29, 2008 I've been trying to do this operation with a Java application, but I'm sure there's got to be a way to do this with mySQL. I have a database (with about 2.2 million records) with thousands of duplicate entries. I only want to check if it's a duplicate with three fields: companyname, city, and state. If more than one record in the database has that same information, then delete the extra records. How would I go about doing this? (hopefully in an efficient way, since it'll probably take a while) Quote Link to comment Share on other sites More sharing options...
mdnghtblue Posted May 29, 2008 Author Share Posted May 29, 2008 I've got a way to select the unique fields: SELECT DISTINCT companyname,city,state from member; So how do I go about deleting everything else? I found some articles that suggest creating a new table and moving the distinct data there. I just don't know how to combine the insert statement and select statement above. Quote Link to comment Share on other sites More sharing options...
mdnghtblue Posted May 29, 2008 Author Share Posted May 29, 2008 Sorry for the triple post but my modify button is missing... I'm trying to use this query to insert the distinct rows into the new table, and I get the ensuing error: mysql> insert into new_member select * from member where (select distinct companyname,city,state from member); ERROR 1241 (21000): Operand should contain 1 column(s) How do I fix this? I feel like the answer is so close. I want to copy all the data where companyname,city and state are unique. Quote Link to comment Share on other sites More sharing options...
blirette Posted May 30, 2008 Share Posted May 30, 2008 If you use PHP (sorry, I'm only using that) you could store all ID which are not duplicated in an array ($aUniques). Then make an array of all the ID of the table ($aAllRows). There's a PHP array compare function that remove duplicate numbers from two arrays if I'm correct. So the last array would contain only the duplicate rows... or something like that. Yeah, it's a vague explanation of how I would do it but you should get the point... Hope it helps you! Quote Link to comment Share on other sites More sharing options...
86Stang Posted May 30, 2008 Share Posted May 30, 2008 Hey mdnghtblue, I ran into a near identical problem and was able to solve it. Using this site as a guide, I was able to construct the query that removed duplicate records while leaving one in place. Here was my solution altered from the query they showed: select bad_rows.* from contacts as bad_rows inner join ( select email, MIN(contact_id) as min_id from contacts group by email having count(*) > 1 ) as good_rows on good_rows.email = bad_rows.email and good_rows.min_id <> bad_rows.contact_id; Now I only had about 20,000 records to go through and it took about 2 seconds so I don't know if this is too "brute force" for you but at least it's an option. Quote Link to comment Share on other sites More sharing options...
mdnghtblue Posted June 2, 2008 Author Share Posted June 2, 2008 Thank you so much, I'm pretty sure this works. Here's the query I used: delete bad_rows.* from member as bad_rows inner join ( select companyname,street,city,state, min(id) as min_id from member group by companyname,street,city,state having count(*) > 1 ) as good_rows on good_rows.companyname = bad_rows.companyname and good_rows.min_id <> bad_rows.id; I tried it on really small tables and it worked. I'm trying it on a table with about 400,000 records and it's taken several hours already, guess I'm just gonna run it overnight. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.