Jump to content


finding duplicate images

  • Please log in to reply
3 replies to this topic

#1 Marl

  • New Members
  • Pip
  • Newbie
  • 1 posts

Posted 02 April 2006 - 01:49 PM

I have a website with about 15 000 images, and get about 20-30 new every day. However - duplicates are becomming a problem, and right now i'm looking for a method to find alert the user who are uploading that his/her image may already be in the database.

First - i have a SHA-checksum on all the images, so it's not a problem to find a exact duplicate, but the problem arise when it has been scaled, saved an extra time (jpg) or someting like that.

After searching for a solution i found out that ImageMagick has a "compare"-function, but running compare on 15000+ images every time a user is uploading a new image is not an option :/

Another method i was thinking about was taking 20-30 "testpixels" from each image, and save the color, and try to match them on every new image. However - this would only work if the images has the same size.

The last solution I have been thinking about is calculating some sort of "average color" of a picture, but i fear that it wouldn't be very reliable and either return far to many, or only exact copies.

How would you solve this problem?
Isn't there some kind of standart solution?

#2 Desdinova

  • Members
  • PipPipPip
  • Advanced Member
  • 41 posts

Posted 02 April 2006 - 03:25 PM

Well for starters you could of course limit your needs.

So instead of wanting to make sure you get ALL duplicates, just find some, or a lot. And for this you could use the methods you described yourself.

And maybe you should think of Filesize.

#3 litebearer

  • Members
  • PipPipPip
  • Advanced Member
  • 2,357 posts
  • Locationwhite lake michigan

Posted 02 April 2006 - 04:48 PM

Sunday morning, daylight savings time, lost another hour of sleep, just took my meds, not happy with the bald spot that keeps growing on the back of my head (but at least its not being replaced by ones on my back or in my ears), anyway how about...

a separate table |imageID|cksum value of image|cksum value of thumb|
index on cksum values

when new image uploaded, search the table for matches?


all the brothers were valiant!

[br][br]The truely intelligent people are not those who create the dots; rather they are they ones with the ability to connect the dots into a coherent picture

#4 redbullmarky

  • Staff Alumni
  • Advanced Member
  • 2,863 posts
  • LocationBedfordshire, England

Posted 02 April 2006 - 05:18 PM

i've been looking at ways of doing a similar thing, and there were two occasions where an image match would cause issues and some solutions i had for both:

1, image has been cropped compared to the existing image/image is the larger, uncropped version of the existing image - i thought, even though it's a lengthly process, taking a few 'sample' lines from the smaller of the two images using 'imagecolorat' (GD library). then going through each line of the larger image (also with imagecolorat) looking for a match.

2, image has been resized. scale the smaller picture either normal or resampled (using GD library again) so it's the same size as the larger picture. run similar check as in point 1 above.

ok so it's not going to be perfect, but if it cuts down on even 10 or 20% of duplicates, it's a start, and a smaller problem to what you have now. i've tested out number 1 on a few images and found it to be reasonably successful at what it does.

to be honest, you're not going to get anything that's gonna search 15,000 images doing any of these methods working very fast. all i can suggest is you let the user upload whatever picture they choose, but use a function like this to 'prune' your files yourself to tidy things up a bit.
"you have to keep pissing in the wind to learn how to keep your shoes dry..."

I say old chap, that is rather amusing!

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users