marcosr Posted May 28, 2008 Share Posted May 28, 2008 I am developing a small image hosting to add to my personal website. One of the problems I want to avoid is the ammount of disk space that images require. As to avoid the mentioned problem I decided to check if the image that the user wants to upload is already stored in the server. To do this I save in the DB the contents of the image, I mean I save the result of file_get_contents($image) after encripting it with md5 (because I think it creates unique values, right?). Then I want to check if the file_get_contents of the image that the user is uplading is/isn't already in the DB. Do you think my system will work correctly? What problems do you think I may have in the future? Is there a better way of doing what I want? Thanks, Marcos. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/ Share on other sites More sharing options...
DarkWater Posted May 28, 2008 Share Posted May 28, 2008 Don't MD5 the image binary data, lol. Then the image won't work...O_O Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552084 Share on other sites More sharing options...
marcosr Posted May 28, 2008 Author Share Posted May 28, 2008 I don't MD5 the image binary data in the image, I only get this data and MD5 it for saving it into the DB in a field named 'image_content' but I leave the image as the user sent it. The purpose of this process is to store the image binary data for comparing the image binary data of the image being uploaded with the ones that are already stored. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552092 Share on other sites More sharing options...
marcosr Posted May 28, 2008 Author Share Posted May 28, 2008 I would thank any type of help ! Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552191 Share on other sites More sharing options...
roopurt18 Posted May 28, 2008 Share Posted May 28, 2008 I've often wondered why more of the social media sites don't do this. It's so easy to detect duplicate files (at least IMO). The only method to determine if two files A and B are duplicates is to determine if they have the same exact sequence of bits. However, you don't want to go about trying to compare a newly uploaded file with every other existing file on your site for obvious reasons. What you can do is create a key from each file that is uploaded, and an MD5 hash seems appropriate enough, and store that key in a DB along with the rest of the file's information. Keep in mind that MD5 can have collisions, so it is possible for two files to generate the same hash, but that's not really a problem as the likelihood is low. So every time a new file is uploaded, hash it and look for entries in the DB that have the same hash. For each entry in the DB, compare the newly uploaded file with the existing files on disk to determine if they are the same file or not. Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats. For example, the same image uploaded as a gif and a jpg. Or a jpg is uploaded in different resolutions. Or a movie file in different formats. In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item). Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552205 Share on other sites More sharing options...
.josh Posted May 28, 2008 Share Posted May 28, 2008 I've often wondered why more of the social media sites don't do this. It's so easy to detect duplicate files (at least IMO). . . . Keep in mind the one place where this can fall apart is if what would be the same file is uploaded in different formats. For example, the same image uploaded as a gif and a jpg. Or a jpg is uploaded in different resolutions. Or a movie file in different formats. In these cases it is difficult to automatically detect duplicate submissions and as such, you should build a user-oriented feature into the site for users to report duplicates (along with the original item). ..or someone photoshopping and pushing 1 pixel to the side, or adds a hidden layer, or...etc.. and there's your answer to your first question. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552210 Share on other sites More sharing options...
roopurt18 Posted May 29, 2008 Share Posted May 29, 2008 But that's not usually the case IMO. The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain. And this would really be the first line of sense for detecting duplicate media. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552347 Share on other sites More sharing options...
.josh Posted May 29, 2008 Share Posted May 29, 2008 But that's not usually the case IMO. The more common case would be a piece of media originating at a site, traveling from person to person, and then being re-uploaded by someone else further down the chain. And this would really be the first line of sense for detecting duplicate media. eh, I suppose I will agree with that. I agree that it would stop duplicates from happening from that venue, but I just think that since it's so easy to "get around" that, many people don't bother at all. Or perhaps for whatever reason(s) they just don't care, or look at it as some kind of issue. I guess first and foremost you'd have to judge the site on an individual basis. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552441 Share on other sites More sharing options...
marcosr Posted May 29, 2008 Author Share Posted May 29, 2008 Thanks friends ! I have implemented it and it works fine, If I find any issues I will report them here as to help anybody who wants to do something similar. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-552605 Share on other sites More sharing options...
roopurt18 Posted May 29, 2008 Share Posted May 29, 2008 @CV, I think the reason most people don't bother with it is because of how cheap storage is. Use a lot of programming time to come up with sophisticated methods of detecting duplicate media in addition to the added CPU strain VS. spending a couple hundred on a TB of storage. Quote Link to comment https://forums.phpfreaks.com/topic/107691-solved-how-can-i-identify-duplicated-images/#findComment-553027 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.