Jump to content

Comparing whether files are identical


Recommended Posts

When a file is uploaded, I wish to determine whether an existing identical file (regardless of what it is named or what extension it has) already exists and if so don't save the newly uploaded file but just point it to the existing one.  My primary reason for doing so is not really to save hard drive space but to be able to manage duplicate documents.

One option is to store a hash of the documents in the DB and when a new document is uploaded, query the DB to see if an identical hash exists.  How certain can one be that two hashes will never be duplicated?

Another option is to compare the byte stream.  For this, I would first query the DB to see if a file with same size and media type exists, and then compare the document to all existing files which have that same criteria.

Or many some other approach?

Recommendations?

Thanks


 

<?php

error_reporting(E_ALL);
ini_set('display_startup_errors', 1);
ini_set('display_errors', 1);

function compareFilesTest(string $file_a, string $file_b)
{
    $time = microtime(true);
    display($time, $file_a, $file_b, 'compareFilesTest', compareFiles($file_a, $file_b));
}
function compareFilesTest2(string $file_a, string $file_b)
{
    $time = microtime(true);
    display($time, $file_a, $file_b, 'compareFilesTest2', identical($file_a, $file_b));
}

function compareFilesHashTest(string $file_a, string $file_b, string $algo = 'md5')
{
    $time = microtime(true);
    display($time, $file_a, $file_b, 'compareFilesHashTest', hash_file($algo, $file_a)===hash_file($algo, $file_b), $algo);
}

function display(float $time, string $file_a, string $file_b, string $test, bool $status, string $algorithm=null)
{
    printf("\n%s\nFile1: %s\nFile2: %s\nStatus: %s\nTime (uS): %d\n", $test.($algorithm?" ($algorithm)":''), $file_a, $file_b, $status?'EQUAL':'NOT EQUAL', 1000000*(microtime(true) - $time));
}

function compareFiles(string $file_a, string $file_b):bool
{
    if (filesize($file_a) == filesize($file_b))
    {
        $fp_a = fopen($file_a, 'rb');
        $fp_b = fopen($file_b, 'rb');

        while (!feof($fp_a) && ($b = fread($fp_a, 4096)) !== false) {
            $b_b = fread($fp_b, 4096);
            if ($b !== $b_b)
            {
                fclose($fp_a);
                fclose($fp_b);
                return false;
            }
        }

        fclose($fp_a);
        fclose($fp_b);

        return true;
    }

    return false;
}

function identical($fileOne, $fileTwo)
{
    if (filetype($fileOne) !== filetype($fileTwo)) return false;
    if (filesize($fileOne) !== filesize($fileTwo)) return false;

    if (! $fp1 = fopen($fileOne, 'rb')) return false;

    if (! $fp2 = fopen($fileTwo, 'rb'))
    {
        fclose($fp1);
        return false;
    }

    $same = true;

    while (! feof($fp1) and ! feof($fp2))
        if (fread($fp1, 4096) !== fread($fp2, 4096))
        {
            $same = false;
            break;
        }

        if (feof($fp1) !== feof($fp2)) $same = false;

    fclose($fp1);
    fclose($fp2);

    return $same;
}

$path = __DIR__.'/test_documents/';

//print_r(hash_algos());

compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpg');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c1c96440_MasterFormat-2016.pdf');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'x602a9c07af00c_IMG_1225.jpg');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpgx');
compareFilesHashTest($path.'file1.txt', $path.'file2.md');
compareFilesHashTest($path.'file1.txt', $path.'file3.txt');

compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpg', 'sha256');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c1c96440_MasterFormat-2016.pdf', 'sha256');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'x602a9c07af00c_IMG_1225.jpg', 'sha256');
compareFilesHashTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpgx', 'sha256');
compareFilesHashTest($path.'file1.txt', $path.'file2.md', 'sha256');
compareFilesHashTest($path.'file1.txt', $path.'file3.txt', 'sha256');

compareFilesTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpg');
compareFilesTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c1c96440_MasterFormat-2016.pdf');
compareFilesTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'x602a9c07af00c_IMG_1225.jpg');
compareFilesTest($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpgx');
compareFilesTest($path.'file1.txt', $path.'file2.md');
compareFilesTest($path.'file1.txt', $path.'file3.txt');

compareFilesTest2($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpg');
compareFilesTest2($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c1c96440_MasterFormat-2016.pdf');
compareFilesTest2($path.'602a9c07af00c_IMG_1225.jpg', $path.'x602a9c07af00c_IMG_1225.jpg');
compareFilesTest2($path.'602a9c07af00c_IMG_1225.jpg', $path.'602a9c07af00c_IMG_1225.jpgx');
compareFilesTest2($path.'file1.txt', $path.'file2.md');
compareFilesTest2($path.'file1.txt', $path.'file3.txt');

 

Link to comment
Share on other sites

It is possible to have an MD5 collision however it is remote. Something like 1:2^64/2. It is not possible to have the same SHA-1 sum too. If you get a matching MD5 hash then calculate the SHA-1 sum to see if they are the same as well. Then you would be safe rejecting the file if it is that critical.

Link to comment
Share on other sites

Thanks gw1500se,  Looks like I can go with MD5 and also invest $1 in the lottery as insurance.  Why do you recommend going with MD5 and not SHA-1 (or even SHA-256) from the get-go?  I didn't say so but the frequency of uploading files will be fairly low.

Thanks requeinx, For my application, duplicates will be very likely as users will often upload the same document for multiple opportunities.  Would you still recommend also comparing the size (likely will include both the hash, size, and mime type in the query) and the content?  Surprisingly to me, it takes about 5 times longer to calculate two files hashes (granted I will only need to do one because the other is in the DB) that to compare content using the both functions I showed, and seems like a small performance hit for some peace of mind and have no issuing doing so.

Link to comment
Share on other sites

24 minutes ago, NotionCommotion said:

Thanks requeinx, For my application, duplicates will be very likely as users will often upload the same document for multiple opportunities.  Would you still recommend also comparing the size (likely will include both the hash, size, and mime type in the query) and the content?  Surprisingly to me, it takes about 5 times longer to calculate two files hashes (granted I will only need to do one because the other is in the DB) that to compare content using the both functions I showed, and seems like a small performance hit for some peace of mind and have no issuing doing so.

There are some caching nuances to consider, some from PHP and some from the operating system, but there is one thing you forgot with the hash test:

The hash for the first file is only going to be calculated once.

The compare-by-hash test would be more accurate if you gave it a hash string for the first file, which should reduce the execution time by about half.

function compareFilesHashTest(string $file_a, string $file_b, string $algo = 'md5')
{
    $hash_a = hash_file($algo, $file_a);
    $time = microtime(true);
    display($time, $file_a, $file_b, 'compareFilesHashTest', $hash_a===hash_file($algo, $file_b), $algo);
}

 

>99.9% of the time, the hash will catch the duplication. For the remainder, the file size is a very easy hurdle to clear before you move on to the longer process of reading the contents from both files.
As for hashing algorithm, you don't need security here. What you need is an algorithm that is fast and will produce enough entropy so that hashes are the least likely to collide. MD5 is faster than SHA-1, and while it has 128 bits compared to the other's 160 bits, odds are still that you'll need more than (a number 20-digits long) documents before you hit a collision.

Link to comment
Share on other sites

Oh, dammit, I'm thinking about this all backwards.

The odds of a collision are mostly irrelevant because they're unlikely to happen unless the files are identical. And that is going to happen.

But MD5 is still fine because it's still astronomical for the hashes of two different files to match. All it does is increase the average number of files you'll have to manually compare by a negligible amount.

Link to comment
Share on other sites

i would compare the size, then the hash. if the size/hash for existing files is in a db table, you can just query to find any potential matches, then compare the contents only for files with a size/hash match.

some code i did long ago, with some new comments related to including a newly uploaded file in the data -

<?php
// dupless (a file compare utility) using php

// read all the files and file size, store using the size as main array key, path/file as the data

// all sizes with only one file are unique - remove

// loop though each size array and get the md5 of the file, store using the md5 as a key, path/file as the data

// all sizes/md5 with only one file are unique - remove

// remaining with same size/md5 are likely the same, you would need to actually compare contents at this point

$path = './keep/'; // get list of existing files

$files = glob($path.'*.*');
// because the $files data contains the path/file, you can array_merge() other files here to
// include them in the comparison, such as a newly uploaded file. if the path/file is in the results,
// there's an existing file with the same size/hash as the uploaded file
$files = array_merge($files, glob('*.*'));

$data = array();
// get data by size
foreach($files as $file){
	$data[filesize($file)][] = $file;
}

foreach($data as $size=>$arr){
	// remove unique size
	if(count($arr)==1){
		unset($data[$size]);
	} else {
		// more than one for this size
		foreach($arr as $pos=>$file){
			// remove existing element
			unset($data[$size][$pos]);
			// replace with md5 of the file
			$data[$size][md5_file($file)][] = $file;
		}
		// for the current size array, remove unique md5 entries
		foreach($data[$size] as $key2=>$arr2){
			if(count($arr2)== 1){
				unset($data[$size][$key2]);
			}
		}
		// if the current size is now empty, remove
		if(count($data[$size])==0){
			unset($data[$size]);
		}
	}
}

echo '<pre>',print_r($data,true),'</pre>';

 

Link to comment
Share on other sites

Thanks requinix and mac_gyver,

Sounds like going with MD5 and then some additional checks is the route to go.

I would like to provide a little more context and explore a related topic.

The application for documentation management and is multitenancy where each tenant will only have access to their own documents.  That being said, if two files owned by two tenants is the same string of bits, only the user_provided_filename and uploaded_at belongs to the tenant and the physical file belongs to both of them.  My schema will look something like the following, and the physical file will not be exposed as a resource but will instead use read_file() or maybe X-Sendfile, etc to expose them.

document

  • id (PK)
  • tenant_id (FK)
  • user_provided_filename (string)
  • uploaded_at (datatime)
  • physical_file_id (datatype TBD and FK to physical_file.id)
  • file_storage_location, file size, media type (NOT INCLUDED)

physical_file

  • id (TBD)
  • hash (TBD whether needed and if so what the unique strategy should be)
  • size (int)
  • media_type_id (int FK to media_type_id or maybe use media_type as a natural key but probably not)
  • file_storage_location (string.  TBD whether required)
  • user_provided_filename, uploaded_at, tenant_id (NOT INCLUDED)

Originally, I was contemplating using the files hash as the physical_file's primary key.  And maybe get rid of file_storage_location and using the first four bytes of the hash as the root directory, next four a sub directory, and then save the file with the name as the hash and no extension.

But this won't work based on both of your comments as I might have duplicate hashes.

So, now I am considering making physical_file's primary key an auto-increment integer, including the hash column with an index but not a unique index (not even with size).  When a new document is uploaded, I will get its MD5 hash and will query the DB for existing records with the same hash, size, and media-type (agree with media-type?).  If one or more files exist, I will compare the content against each until I get a match and return the ID to be stored in the documents table.  If I don't have a match, I will store the file on disk using maybe its PK appended by a ".tbd" extension (I know using the PK outside its intended use in the DB is often frowned on but seems reasonable).  For a directory structure, maybe store files with ID from 1 to 10000 in directory "dir-1-to-10000", etc (or 1 to 1000?).

Seem reasonable?  Thanks

Link to comment
Share on other sites

2 hours ago, NotionCommotion said:

The application for documentation management and is multitenancy where each tenant will only have access to their own documents.  That being said, if two files owned by two tenants is the same string of bits, only the user_provided_filename and uploaded_at belongs to the tenant and the physical file belongs to both of them.

Why do you care about having them both share the same bytes? Disk space is cheap, and this scheme is making stuff complicated.

Link to comment
Share on other sites

3 hours ago, requinix said:

Why do you care about having them both share the same bytes? Disk space is cheap, and this scheme is making stuff complicated.

I agree and the primary reason is to improve the user experience regarding identical files.  Various users will be uploading various documents and these documents will be associated with multiple scopes and these scopes will be associated with multiple projects and these projects will be associated with multiple assets.  The majority of the documents will be PDF's and will often be downloaded from the same source and be identical but will likely have different filenames.  When the project is over, other users will need to access the documents and will perform searches based on things they understand (i.e. the project, scope, and/or asset) and not by the filename which they wouldn't know.  When returning the search results, I don't want to have them need to look at a bunch of identical documents but only the unique ones.   The only way I could think of doing this is decouple the physical file from the document.  Of course, this approach will not help with scanned documents but hopefully will be good enough.  Can you think of a better way?  Thanks

On 6/8/2021 at 9:36 AM, NotionCommotion said:

My primary reason for doing so is not really to save hard drive space but to be able to manage duplicate documents.

 

Link to comment
Share on other sites

Do the users know that these files are unique? Because I would expect that most of the file, when someone uploads a file to a place then they'll expect it to be available at that place. Personally, I think it would be weird to search for a file that I know was uploaded 10 different times and only see one search result.

Link to comment
Share on other sites

30 minutes ago, requinix said:

Do the users know that these files are unique? Because I would expect that most of the file, when someone uploads a file to a place then they'll expect it to be available at that place. Personally, I think it would be weird to search for a file that I know was uploaded 10 different times and only see one search result.

There are two types of users: vendor users and end users.  Vendor users will be uploading files during the project execution and the path where they were posted will remain the path that will retrieve them.  End users will later inherit the documents and will only care about quickly getting documents about what they care about.  They may still care about whether it was uploaded 10 different times and for which project it was uploaded under, but will not wish to take time determining whether they are looking at the same or different files, and instead will be presented with just the unique files along with the various places it was uploaded.

Link to comment
Share on other sites

Then how about changing your approach to allowing multiple uploads of the same file that will create duplicates, but giving people an easy way to (1) store documents under their account or something similar, then (2) "sharing" those documents with whatever projects.

Vendors upload documents, like certificates and proofs and whatever, and then add those documents to their project. It creates a clearer understanding that there's effectively only the one document which is being shared with the end users. And, frankly, I think it models the real-world behavior better than worrying about detecting identical files.

Link to comment
Share on other sites

Thanks requinix,  Maybe but not sure.  Will need to discuss with others.

Think I am going to use what I learned here for something else.  Have so many dang duplicated photos it stresses me to even attempt to organize them.  Should be fairly easy to sort by size and then compare.  But then I will want to read the meta data and such.   Maybe another day...

Appreciate the help.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.