Jump to content

finding duplicate images


Marl

Recommended Posts

I have a website with about 15 000 images, and get about 20-30 new every day. However - duplicates are becomming a problem, and right now i'm looking for a method to find alert the user who are uploading that his/her image may already be in the database.

First - i have a SHA-checksum on all the images, so it's not a problem to find a exact duplicate, but the problem arise when it has been scaled, saved an extra time (jpg) or someting like that.

After searching for a solution i found out that ImageMagick has a "compare"-function, but running compare on 15000+ images every time a user is uploading a new image is not an option :/

Another method i was thinking about was taking 20-30 "testpixels" from each image, and save the color, and try to match them on every new image. However - this would only work if the images has the same size.

The last solution I have been thinking about is calculating some sort of "average color" of a picture, but i fear that it wouldn't be very reliable and either return far to many, or only exact copies.

How would you solve this problem?
Isn't there some kind of standart solution?
Link to comment
Share on other sites

Sunday morning, daylight savings time, lost another hour of sleep, just took my meds, not happy with the bald spot that keeps growing on the back of my head (but at least its not being replaced by ones on my back or in my ears), anyway how about...

a separate table |imageID|cksum value of image|cksum value of thumb|
index on cksum values

when new image uploaded, search the table for matches?

Lite...
Link to comment
Share on other sites

i've been looking at ways of doing a similar thing, and there were two occasions where an image match would cause issues and some solutions i had for both:

1, image has been cropped compared to the existing image/image is the larger, uncropped version of the existing image - i thought, even though it's a lengthly process, taking a few 'sample' lines from the smaller of the two images using 'imagecolorat' (GD library). then going through each line of the larger image (also with imagecolorat) looking for a match.

2, image has been resized. scale the smaller picture either normal or resampled (using GD library again) so it's the same size as the larger picture. run similar check as in point 1 above.

ok so it's not going to be perfect, but if it cuts down on even 10 or 20% of duplicates, it's a start, and a smaller problem to what you have now. i've tested out number 1 on a few images and found it to be reasonably successful at what it does.

to be honest, you're not going to get anything that's gonna search 15,000 images doing any of these methods working very fast. all i can suggest is you let the user upload whatever picture they choose, but use a function like this to 'prune' your files yourself to tidy things up a bit.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.