Jump to content

Archived

This topic is now archived and is closed to further replies.

Marl

finding duplicate images

Recommended Posts

I have a website with about 15 000 images, and get about 20-30 new every day. However - duplicates are becomming a problem, and right now i'm looking for a method to find alert the user who are uploading that his/her image may already be in the database.

First - i have a SHA-checksum on all the images, so it's not a problem to find a exact duplicate, but the problem arise when it has been scaled, saved an extra time (jpg) or someting like that.

After searching for a solution i found out that ImageMagick has a "compare"-function, but running compare on 15000+ images every time a user is uploading a new image is not an option :/

Another method i was thinking about was taking 20-30 "testpixels" from each image, and save the color, and try to match them on every new image. However - this would only work if the images has the same size.

The last solution I have been thinking about is calculating some sort of "average color" of a picture, but i fear that it wouldn't be very reliable and either return far to many, or only exact copies.

How would you solve this problem?
Isn't there some kind of standart solution?

Share this post


Link to post
Share on other sites
Well for starters you could of course limit your needs.

So instead of wanting to make sure you get ALL duplicates, just find some, or a lot. And for this you could use the methods you described yourself.

And maybe you should think of Filesize.

Share this post


Link to post
Share on other sites
Sunday morning, daylight savings time, lost another hour of sleep, just took my meds, not happy with the bald spot that keeps growing on the back of my head (but at least its not being replaced by ones on my back or in my ears), anyway how about...

a separate table |imageID|cksum value of image|cksum value of thumb|
index on cksum values

when new image uploaded, search the table for matches?

Lite...

Share this post


Link to post
Share on other sites
i've been looking at ways of doing a similar thing, and there were two occasions where an image match would cause issues and some solutions i had for both:

1, image has been cropped compared to the existing image/image is the larger, uncropped version of the existing image - i thought, even though it's a lengthly process, taking a few 'sample' lines from the smaller of the two images using 'imagecolorat' (GD library). then going through each line of the larger image (also with imagecolorat) looking for a match.

2, image has been resized. scale the smaller picture either normal or resampled (using GD library again) so it's the same size as the larger picture. run similar check as in point 1 above.

ok so it's not going to be perfect, but if it cuts down on even 10 or 20% of duplicates, it's a start, and a smaller problem to what you have now. i've tested out number 1 on a few images and found it to be reasonably successful at what it does.

to be honest, you're not going to get anything that's gonna search 15,000 images doing any of these methods working very fast. all i can suggest is you let the user upload whatever picture they choose, but use a function like this to 'prune' your files yourself to tidy things up a bit.

Share this post


Link to post
Share on other sites

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.