finding duplicate images
Posted 02 April 2006 - 01:49 PM
First - i have a SHA-checksum on all the images, so it's not a problem to find a exact duplicate, but the problem arise when it has been scaled, saved an extra time (jpg) or someting like that.
After searching for a solution i found out that ImageMagick has a "compare"-function, but running compare on 15000+ images every time a user is uploading a new image is not an option :/
Another method i was thinking about was taking 20-30 "testpixels" from each image, and save the color, and try to match them on every new image. However - this would only work if the images has the same size.
The last solution I have been thinking about is calculating some sort of "average color" of a picture, but i fear that it wouldn't be very reliable and either return far to many, or only exact copies.
How would you solve this problem?
Isn't there some kind of standart solution?
Posted 02 April 2006 - 03:25 PM
So instead of wanting to make sure you get ALL duplicates, just find some, or a lot. And for this you could use the methods you described yourself.
And maybe you should think of Filesize.
Posted 02 April 2006 - 04:48 PM
a separate table |imageID|cksum value of image|cksum value of thumb|
index on cksum values
when new image uploaded, search the table for matches?
[br][br]The truely intelligent people are not those who create the dots; rather they are they ones with the ability to connect the dots into a coherent picture
all the brothers were valiant!
Posted 02 April 2006 - 05:18 PM
1, image has been cropped compared to the existing image/image is the larger, uncropped version of the existing image - i thought, even though it's a lengthly process, taking a few 'sample' lines from the smaller of the two images using 'imagecolorat' (GD library). then going through each line of the larger image (also with imagecolorat) looking for a match.
2, image has been resized. scale the smaller picture either normal or resampled (using GD library again) so it's the same size as the larger picture. run similar check as in point 1 above.
ok so it's not going to be perfect, but if it cuts down on even 10 or 20% of duplicates, it's a start, and a smaller problem to what you have now. i've tested out number 1 on a few images and found it to be reasonably successful at what it does.
to be honest, you're not going to get anything that's gonna search 15,000 images doing any of these methods working very fast. all i can suggest is you let the user upload whatever picture they choose, but use a function like this to 'prune' your files yourself to tidy things up a bit.
I say old chap, that is rather amusing!
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users