Jump to content

Why does this SQL take so long to run?


tarsier

Recommended Posts

Hello all,

I wrote the following SQL command (with some help) to analize a table with 25000 records and delete the lines with duplicate file_name fields. It deletes the older file, by file_date. If two files have the same data, it deletes the first on it finds. When I ran this on my test table, it works great. I tried it on the big table and it takes a very long time to run, something like 5-10 min. Is this normal? Is there a way to speed this up? Is there a better way to do this?

 

DELETE tmp_ph_part_files FROM tmp_ph_part_files JOIN (

  SELECT a.file_id FROM tmp_ph_part_files AS a

  JOIN tmp_ph_part_files AS b ON a.file_name = b.file_name AND a.file_id <> b.file_id AND

  ((a.file_date < b.file_date) OR (a.file_date = b.file_date AND a.file_id > b.file_id))

) AS sub WHERE tmp_ph_part_files.file_id = sub.file_id

 

Thanks for any help you can give.

Link to comment
Share on other sites

I don't really have a solution that I know will work, but I'll try to give you some ideas.

 

First, one question about your existing query.  I've tried to understand what you wish to accomplish with your query, and I've come up with this:  You want to find duplicate files and delete the older versions, right?  But when they have the same date, you want to delete the one with the greater file_id number?  (...a.file_id > b.file_id...)  Is that right?  That would seem to me to mean, if you use AUTO_INCREMENT, that you would be deleting the newer version of the files created on the same day.

 

I guess a corollary of that question would be why didn't you use a TIMESTAMP (or DATETIME) for the file instead of a DATE?  That would be an easier way to find the newest version of a file.  But that's irrelevant to your post, and also irrelevant if your original query was correct.

 

Do your tables have indicies?  That should help your performance.

 

Five to ten minutes does seem a little long, but that's also dependent on your hardware.  Your query can get pretty big since it's joining a table to itself almost three times (25000^3).

 

How many duplicate file names do you typically have?  It's possible that you could exclude those without duplicates as soon as possible and be working with a smaller result set.  I.e.,

 

SELECT file_name,COUNT(file_id) AS c,MAX(file_date) AS m FROM tmp_ph_part_files GROUP BY file_name HAVING c > 1

 

and use that as your first/innermost subselect.

Link to comment
Share on other sites

Wildbug,

Thanks for the replay.

 

Here is a little more info about what I'm doing. I'm doing a scan of a huge directory tree and looking for new files to copy into the database. This is a task that will probably run several times a day. First, I scan the directory tree, and add all file names, paths, and creation dates to a table on the database. I then remove all duplicate files and older versions of the same file. Lastlly, I look at the files that I have and copy it into the database, if I don't have it yet.

 

I have not had to deal with tables this large before, so I have a lot to learn about that. I did not have any indexes, aside from the primary key. I've added some indexes and I'll see how that goes.

 

I will also try to do some of the quicker opperations first.

 

Thanks for the sugestions.

 

 

Link to comment
Share on other sites

have you thought of using

 

 

 

 

INSERT ........ON DUPLICATE KEY UPDATE

This updates old records and inserts new ones if they aren't there.

 

You don't need to delete older files as this will update you old records and add new ones if they don't find any.

Link to comment
Share on other sites

Thanks for all the input.

 

It's running on a Win32 system, so no such luck with a find command.

 

The purpose of the project is to store and orgnize pdf files of mechanical prints. The drafting department creates drawings, which can be placed anywhere in a large directory structure. The thought was to have an automatic process scan the structure several times a day and copy any new or updated files into the databaes. The php program parses the part number, revision, and page from the filename. It then removes duplicate files. Finily, it compares the list of scaned files to the ones it has already and makes a copy into the database of the new or modified ones. Maybe there is a better way to scan for duplicate files.

 

I added indexes to the table, of which I had none before, and it is now going MUCH faster. I have not timed it yet but it's easlly twice or three times faster.

 

I will have to look into using "INSERT ........ON DUPLICATE KEY UPDATE" The only issue I might have is that there are several indexes for a given file. Filename is not unique.

 

Again, thanks for the input.

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.