[SOLVED] How can I identify duplicated images?

marcosr · May 28, 2008

I am developing a small image hosting to add to my personal website.

One of the problems I want to avoid is the ammount of disk space that images require.

As to avoid the mentioned problem I decided to check if the image that the user wants to upload is already stored in the server. To do this I save in the DB the contents of the image, I mean I save the result of file_get_contents($image) after encripting it with md5 (because I think it creates unique values, right?).

Then I want to check if the file_get_contents of the image that the user is uplading is/isn't already in the DB.

Do you think my system will work correctly?

What problems do you think I may have in the future?

Is there a better way of doing what I want?

Thanks,

Marcos.

DarkWater · May 28, 2008

Don't MD5 the image binary data, lol. Then the image won't work...O_O

marcosr · May 28, 2008

I don't MD5 the image binary data in the image, I only get this data and MD5 it for saving it into the DB in a field named 'image_content' but I leave the image as the user sent it. The purpose of this process is to store the image binary data for comparing the image binary data of the image being uploaded with the ones that are already stored.

marcosr · May 28, 2008

I would thank any type of help !

roopurt18 · May 28, 2008

I've often wondered why more of the social media sites don't do this. It's so easy to detect duplicate files (at least IMO).

The only method to determine if two files A and B are duplicates is to determine if they have the same exact sequence of bits. However, you don't want to go about trying to compare a newly uploaded file with every other existing file on your site for obvious reasons.

What you can do is create a key from each file that is uploaded, and an MD5 hash seems appropriate enough, and store that key in a DB along with the rest of the file's information. Keep in mind that MD5 can have collisions, so it is possible for two files to generate the same hash, but that's not really a problem as the likelihood is low. So every time a new file is uploaded, hash it and look for entries in the DB that have the same hash. For each entry in the DB, compare the newly uploaded file with the existing files on disk to determine if they are the same file or not.

Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats. For example, the same image uploaded as a gif and a jpg. Or a jpg is uploaded in different resolutions. Or a movie file in different formats. In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item).

.josh · May 28, 2008

I've often wondered why more of the social media sites don't do this. It's so easy to detect duplicate files (at least IMO).

.

.

.

Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats. For example, the same image uploaded as a gif and a jpg. Or a jpg is uploaded in different resolutions. Or a movie file in different formats. In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item).

..or someone photoshopping and pushing 1 pixel to the side, or adds a hidden layer, or...etc.. and there's your answer to your first question.

roopurt18 · May 29, 2008

But that's not usually the case IMO. The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain.

And this would really be the first line of sense for detecting duplicate media.

.josh · May 29, 2008

But that's not usually the case IMO. The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain.

And this would really be the first line of sense for detecting duplicate media.

eh, I suppose I will agree with that. I agree that it would stop duplicates from happening from that venue, but I just think that since it's so easy to "get around" that, many people don't bother at all. Or perhaps for whatever reason(s) they just don't care, or look at it as some kind of issue. I guess first and foremost you'd have to judge the site on an individual basis.

marcosr · May 29, 2008

Thanks friends !

I have implemented it and it works fine, If I find any issues I will report them here as to help anybody who wants to do something similar.

roopurt18 · May 29, 2008

@CV, I think the reason most people don't bother with it is because of how cheap storage is. Use a lot of programming time to come up with sophisticated methods of detecting duplicate media in addition to the added CPU strain VS. spending a couple hundred on a TB of storage.

Sign In

[SOLVED] How can I identify duplicated images?

Recommended Posts

marcosr

Link to comment

Share on other sites

DarkWater

Link to comment

Share on other sites

marcosr

Link to comment

Share on other sites

marcosr

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

marcosr

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information