Jump to content

Working With 500, 000 Files & Hashing?


Welters

Recommended Posts

Hi there, really not sure if this can be implemeted any other way than by a seperate program, but interested to know...bare in mind the page would be stored locally to the files so it's not like they'd all have to be uploaded...

 

Basically I'm trying to create a system that removes duplicates of files by the use of MD5 hashes (4) to discover them (3), before converting the originals kept to XML (1) and indexing them in a solr application (2).

 

Both (1) and (2) would be by use of commands issued at the command line (can these be called in php?)

 

(3) would require access to directory information to store the original and duplicate locations in a MySQL database

 

(4) would require access to all lines within each file to indentify the start of certain lines, the removal of line breaks and spaces etc before hashing them

 

 

Now that I've written all this I'm thinking it might be less likely, I just don't know if anything like this is actually possible in a server-side language like PHP. That's all I really want to know.

 

 

Cheers.

 

 

Link to comment
Share on other sites

It won't be slower in php than in any other language .. if it's a slow task then it's a slow task :)

 

Yes, command line programs can be called from php (this may not be true if you use a hosting provider with restrictive security policies).  system() and exec() are some of the interfaces.

Link to comment
Share on other sites

1) a quick md5/crc32 function are out there already

2) diff can spot differences in text files, there is a binary diff app out there used as well.

 

php could be used to recurse the subdirectories, and pass the info to md5/crc32 generator. and so forth. so it's not impossible, but it may take awhile.

 

but php is just an interpreted language, so a compiled language (c/c++/c#) will definately give better performace.

 

good luck

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.