tarsier Posted April 10, 2007 Share Posted April 10, 2007 Hello all, I wrote the following SQL command (with some help) to analize a table with 25000 records and delete the lines with duplicate file_name fields. It deletes the older file, by file_date. If two files have the same data, it deletes the first on it finds. When I ran this on my test table, it works great. I tried it on the big table and it takes a very long time to run, something like 5-10 min. Is this normal? Is there a way to speed this up? Is there a better way to do this? DELETE tmp_ph_part_files FROM tmp_ph_part_files JOIN ( SELECT a.file_id FROM tmp_ph_part_files AS a JOIN tmp_ph_part_files AS b ON a.file_name = b.file_name AND a.file_id <> b.file_id AND ((a.file_date < b.file_date) OR (a.file_date = b.file_date AND a.file_id > b.file_id)) ) AS sub WHERE tmp_ph_part_files.file_id = sub.file_id Thanks for any help you can give. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/ Share on other sites More sharing options...
Wildbug Posted April 10, 2007 Share Posted April 10, 2007 I don't really have a solution that I know will work, but I'll try to give you some ideas. First, one question about your existing query. I've tried to understand what you wish to accomplish with your query, and I've come up with this: You want to find duplicate files and delete the older versions, right? But when they have the same date, you want to delete the one with the greater file_id number? (...a.file_id > b.file_id...) Is that right? That would seem to me to mean, if you use AUTO_INCREMENT, that you would be deleting the newer version of the files created on the same day. I guess a corollary of that question would be why didn't you use a TIMESTAMP (or DATETIME) for the file instead of a DATE? That would be an easier way to find the newest version of a file. But that's irrelevant to your post, and also irrelevant if your original query was correct. Do your tables have indicies? That should help your performance. Five to ten minutes does seem a little long, but that's also dependent on your hardware. Your query can get pretty big since it's joining a table to itself almost three times (25000^3). How many duplicate file names do you typically have? It's possible that you could exclude those without duplicates as soon as possible and be working with a smaller result set. I.e., SELECT file_name,COUNT(file_id) AS c,MAX(file_date) AS m FROM tmp_ph_part_files GROUP BY file_name HAVING c > 1 and use that as your first/innermost subselect. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226068 Share on other sites More sharing options...
tarsier Posted April 10, 2007 Author Share Posted April 10, 2007 Wildbug, Thanks for the replay. Here is a little more info about what I'm doing. I'm doing a scan of a huge directory tree and looking for new files to copy into the database. This is a task that will probably run several times a day. First, I scan the directory tree, and add all file names, paths, and creation dates to a table on the database. I then remove all duplicate files and older versions of the same file. Lastlly, I look at the files that I have and copy it into the database, if I don't have it yet. I have not had to deal with tables this large before, so I have a lot to learn about that. I did not have any indexes, aside from the primary key. I've added some indexes and I'll see how that goes. I will also try to do some of the quicker opperations first. Thanks for the sugestions. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226199 Share on other sites More sharing options...
Wildbug Posted April 10, 2007 Share Posted April 10, 2007 I don't know the details of your project, but here's another thought. Are you scanning the same directory tree each time? If you're on *nix, you could use the find command to only select files that have been modified since your last scan. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226310 Share on other sites More sharing options...
gluck Posted April 10, 2007 Share Posted April 10, 2007 have you thought of using INSERT ........ON DUPLICATE KEY UPDATE This updates old records and inserts new ones if they aren't there. You don't need to delete older files as this will update you old records and add new ones if they don't find any. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226374 Share on other sites More sharing options...
fenway Posted April 11, 2007 Share Posted April 11, 2007 The question really is why you're doing this... all those inequalities in the JOIN conditions are a big problem. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226693 Share on other sites More sharing options...
tarsier Posted April 11, 2007 Author Share Posted April 11, 2007 Thanks for all the input. It's running on a Win32 system, so no such luck with a find command. The purpose of the project is to store and orgnize pdf files of mechanical prints. The drafting department creates drawings, which can be placed anywhere in a large directory structure. The thought was to have an automatic process scan the structure several times a day and copy any new or updated files into the databaes. The php program parses the part number, revision, and page from the filename. It then removes duplicate files. Finily, it compares the list of scaned files to the ones it has already and makes a copy into the database of the new or modified ones. Maybe there is a better way to scan for duplicate files. I added indexes to the table, of which I had none before, and it is now going MUCH faster. I have not timed it yet but it's easlly twice or three times faster. I will have to look into using "INSERT ........ON DUPLICATE KEY UPDATE" The only issue I might have is that there are several indexes for a given file. Filename is not unique. Again, thanks for the input. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226889 Share on other sites More sharing options...
gluck Posted April 11, 2007 Share Posted April 11, 2007 You could add another column for the "INSERT ........ON DUPLICATE KEY UPDATE" constraint. Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-226961 Share on other sites More sharing options...
fenway Posted April 11, 2007 Share Posted April 11, 2007 Why remove the duplicates? Why not simply take the most recent one? Quote Link to comment https://forums.phpfreaks.com/topic/46432-why-does-this-sql-take-so-long-to-run/#findComment-227089 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.