Jump to content

[SOLVED] How can I identify duplicated images?


marcosr

Recommended Posts

I am developing a small image hosting to add to my personal website.

One of the problems I want to avoid is the ammount of disk space that images require.

As to avoid the mentioned problem I decided to check if the image that the user wants to upload is already stored in the server. To do this I save in the DB the contents of the image, I mean I save the result of file_get_contents($image) after encripting it with md5 (because I think it creates unique values, right?).

Then I want to check if the file_get_contents of the image that the user is uplading is/isn't already in the DB.

 

Do you think my system will work correctly?

What problems do you think I may have in the future?

Is there a better way of doing what I want?

 

Thanks,

Marcos.

Link to comment
Share on other sites

I don't MD5 the image binary data in the image, I only get this data and MD5 it for saving it into the DB in a field named 'image_content' but I leave the image as the user sent it. The purpose of this process is to store the image binary data for comparing the image binary data of the image being uploaded with the ones that are already stored.

Link to comment
Share on other sites

I've often wondered why more of the social media sites don't do this.  It's so easy to detect duplicate files (at least IMO).

 

The only method to determine if two files A and B are duplicates is to determine if they have the same exact sequence of bits.  However, you don't want to go about trying to compare a newly uploaded file with every other existing file on your site for obvious reasons.

 

What you can do is create a key from each file that is uploaded, and an MD5 hash seems appropriate enough, and store that key in a DB along with the rest of the file's information.  Keep in mind that MD5 can have collisions, so it is possible for two files to generate the same hash, but that's not really a problem as the likelihood is low.  So every time a new file is uploaded, hash it and look for entries in the DB that have the same hash.  For each entry in the DB, compare the newly uploaded file with the existing files on disk to determine if they are the same file or not.

 

Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats.  For example, the same image uploaded as a gif and a jpg.  Or a jpg is uploaded in different resolutions.  Or a movie file in different formats.  In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item).

Link to comment
Share on other sites

I've often wondered why more of the social media sites don't do this.  It's so easy to detect duplicate files (at least IMO).

.

.

.

Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats.  For example, the same image uploaded as a gif and a jpg.  Or a jpg is uploaded in different resolutions.  Or a movie file in different formats.  In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item).

..or someone photoshopping and pushing 1 pixel to the side, or adds a hidden layer, or...etc.. and there's your answer to your first question.

Link to comment
Share on other sites

But that's not usually the case IMO.  The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain.

 

And this would really be the first line of sense for detecting duplicate media.

Link to comment
Share on other sites

But that's not usually the case IMO.  The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain.

 

And this would really be the first line of sense for detecting duplicate media.

eh, I suppose I will agree with that.  I agree that it would stop duplicates from happening from that venue, but I just think that since it's so easy to "get around" that, many people don't bother at all. Or perhaps for whatever reason(s) they just don't care, or look at it as some kind of issue.  I guess first and foremost you'd have to judge the site on an individual basis. 

 

 

Link to comment
Share on other sites

@CV, I think the reason most people don't bother with it is because of how cheap storage is.  Use a lot of programming time to come up with sophisticated methods of detecting duplicate media in addition to the added CPU strain VS. spending a couple hundred on a TB of storage.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.