Why does this SQL take so long to run?

tarsier · April 10, 2007

Hello all,

I wrote the following SQL command (with some help) to analize a table with 25000 records and delete the lines with duplicate file_name fields. It deletes the older file, by file_date. If two files have the same data, it deletes the first on it finds. When I ran this on my test table, it works great. I tried it on the big table and it takes a very long time to run, something like 5-10 min. Is this normal? Is there a way to speed this up? Is there a better way to do this?

DELETE tmp_ph_part_files FROM tmp_ph_part_files JOIN (

SELECT a.file_id FROM tmp_ph_part_files AS a

JOIN tmp_ph_part_files AS b ON a.file_name = b.file_name AND a.file_id <> b.file_id AND

((a.file_date < b.file_date) OR (a.file_date = b.file_date AND a.file_id > b.file_id))

) AS sub WHERE tmp_ph_part_files.file_id = sub.file_id

Thanks for any help you can give.

Wildbug · April 10, 2007

I don't really have a solution that I know will work, but I'll try to give you some ideas.

First, one question about your existing query. I've tried to understand what you wish to accomplish with your query, and I've come up with this: You want to find duplicate files and delete the older versions, right? But when they have the same date, you want to delete the one with the greater file_id number? (...a.file_id > b.file_id...) Is that right? That would seem to me to mean, if you use AUTO_INCREMENT, that you would be deleting the newer version of the files created on the same day.

I guess a corollary of that question would be why didn't you use a TIMESTAMP (or DATETIME) for the file instead of a DATE? That would be an easier way to find the newest version of a file. But that's irrelevant to your post, and also irrelevant if your original query was correct.

Do your tables have indicies? That should help your performance.

Five to ten minutes does seem a little long, but that's also dependent on your hardware. Your query can get pretty big since it's joining a table to itself almost three times (25000^3).

How many duplicate file names do you typically have? It's possible that you could exclude those without duplicates as soon as possible and be working with a smaller result set. I.e.,

SELECT file_name,COUNT(file_id) AS c,MAX(file_date) AS m FROM tmp_ph_part_files GROUP BY file_name HAVING c > 1

and use that as your first/innermost subselect.

tarsier · April 10, 2007

Wildbug,

Thanks for the replay.

Here is a little more info about what I'm doing. I'm doing a scan of a huge directory tree and looking for new files to copy into the database. This is a task that will probably run several times a day. First, I scan the directory tree, and add all file names, paths, and creation dates to a table on the database. I then remove all duplicate files and older versions of the same file. Lastlly, I look at the files that I have and copy it into the database, if I don't have it yet.

I have not had to deal with tables this large before, so I have a lot to learn about that. I did not have any indexes, aside from the primary key. I've added some indexes and I'll see how that goes.

I will also try to do some of the quicker opperations first.

Thanks for the sugestions.

Wildbug · April 10, 2007

I don't know the details of your project, but here's another thought. Are you scanning the same directory tree each time? If you're on *nix, you could use the find command to only select files that have been modified since your last scan.

gluck · April 10, 2007

have you thought of using

INSERT ........ON DUPLICATE KEY UPDATE

This updates old records and inserts new ones if they aren't there.

You don't need to delete older files as this will update you old records and add new ones if they don't find any.

fenway · April 11, 2007

The question really is why you're doing this... all those inequalities in the JOIN conditions are a big problem.

tarsier · April 11, 2007

Thanks for all the input.

It's running on a Win32 system, so no such luck with a find command.

The purpose of the project is to store and orgnize pdf files of mechanical prints. The drafting department creates drawings, which can be placed anywhere in a large directory structure. The thought was to have an automatic process scan the structure several times a day and copy any new or updated files into the databaes. The php program parses the part number, revision, and page from the filename. It then removes duplicate files. Finily, it compares the list of scaned files to the ones it has already and makes a copy into the database of the new or modified ones. Maybe there is a better way to scan for duplicate files.

I added indexes to the table, of which I had none before, and it is now going MUCH faster. I have not timed it yet but it's easlly twice or three times faster.

I will have to look into using "INSERT ........ON DUPLICATE KEY UPDATE" The only issue I might have is that there are several indexes for a given file. Filename is not unique.

Again, thanks for the input.

gluck · April 11, 2007

You could add another column for the "INSERT ........ON DUPLICATE KEY UPDATE" constraint.

fenway · April 11, 2007

Why remove the duplicates? Why not simply take the most recent one?

Sign In

Why does this SQL take so long to run?

Recommended Posts

tarsier

Link to comment

Share on other sites

Wildbug

Link to comment

Share on other sites

tarsier

Link to comment

Share on other sites

Wildbug

Link to comment

Share on other sites

gluck

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

tarsier

Link to comment

Share on other sites

gluck

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information