Welters Posted December 30, 2008 Share Posted December 30, 2008 Hi there, really not sure if this can be implemeted any other way than by a seperate program, but interested to know...bare in mind the page would be stored locally to the files so it's not like they'd all have to be uploaded... Basically I'm trying to create a system that removes duplicates of files by the use of MD5 hashes (4) to discover them (3), before converting the originals kept to XML (1) and indexing them in a solr application (2). Both (1) and (2) would be by use of commands issued at the command line (can these be called in php?) (3) would require access to directory information to store the original and duplicate locations in a MySQL database (4) would require access to all lines within each file to indentify the start of certain lines, the removal of line breaks and spaces etc before hashing them Now that I've written all this I'm thinking it might be less likely, I just don't know if anything like this is actually possible in a server-side language like PHP. That's all I really want to know. Cheers. Quote Link to comment https://forums.phpfreaks.com/topic/138804-working-with-500-000-files-hashing/ Share on other sites More sharing options...
DarkWater Posted December 30, 2008 Share Posted December 30, 2008 Yeah, it's doable. It's just going to be very slow. >_> Quote Link to comment https://forums.phpfreaks.com/topic/138804-working-with-500-000-files-hashing/#findComment-725796 Share on other sites More sharing options...
btherl Posted December 30, 2008 Share Posted December 30, 2008 It won't be slower in php than in any other language .. if it's a slow task then it's a slow task Yes, command line programs can be called from php (this may not be true if you use a hosting provider with restrictive security policies). system() and exec() are some of the interfaces. Quote Link to comment https://forums.phpfreaks.com/topic/138804-working-with-500-000-files-hashing/#findComment-725858 Share on other sites More sharing options...
laffin Posted December 30, 2008 Share Posted December 30, 2008 1) a quick md5/crc32 function are out there already 2) diff can spot differences in text files, there is a binary diff app out there used as well. php could be used to recurse the subdirectories, and pass the info to md5/crc32 generator. and so forth. so it's not impossible, but it may take awhile. but php is just an interpreted language, so a compiled language (c/c++/c#) will definately give better performace. good luck Quote Link to comment https://forums.phpfreaks.com/topic/138804-working-with-500-000-files-hashing/#findComment-725880 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.