wiggst3r Posted February 4, 2009 Share Posted February 4, 2009 Hi I have a CSV file over 500,000 rows. What's the best way to remove any duplicate rows? Thanks Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/ Share on other sites More sharing options...
gevans Posted February 4, 2009 Share Posted February 4, 2009 What ever you do is going to be slow with 500,000+ lines, but if you get the CSV file line by line into an array and use array_unique() it will filter all duplicate rows Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754398 Share on other sites More sharing options...
wiggst3r Posted February 4, 2009 Author Share Posted February 4, 2009 Isn't that going to be quite memory hungry? Putting such a huge amouunt of rows into an array? Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754403 Share on other sites More sharing options...
Maq Posted February 4, 2009 Share Posted February 4, 2009 Found this litte script online. You may want to use some kind of recursive method to take chunks of the CSV if it's running slow. $filename = "file.csv"; $file = fopen($filename, "r"); $read = fread($file, filesize($filename)); $split = array_unique(explode("\n", $read)); fclose($file); $filename2 = "other.csv"; $file2 = fopen($filename2, "a"); foreach($split as $key=>$value) { if($value != "") { fwrite($file2, $value . "\n"); } } fclose($file2); echo "Update done successfully."; ?> Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754408 Share on other sites More sharing options...
gevans Posted February 4, 2009 Share Posted February 4, 2009 I was actually curious about this, just been running arrays of up to 200,000 lines locally. It struggles a bit, but in a live environment you should have more memory dedicated to the process. Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754415 Share on other sites More sharing options...
wiggst3r Posted February 4, 2009 Author Share Posted February 4, 2009 Well what I'm ideally looking to do is remove all duplicate email addresses and the whole row in which the email address occurs within a CSV with 4 fields. The file is approx 50mb and the CSV has 900,000 rows. Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754444 Share on other sites More sharing options...
gevans Posted February 4, 2009 Share Posted February 4, 2009 And you want to use php to do this? Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754452 Share on other sites More sharing options...
wiggst3r Posted February 4, 2009 Author Share Posted February 4, 2009 What else could I use? I can't open it in openoffice/excel because they are too many rows Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754482 Share on other sites More sharing options...
gevans Posted February 4, 2009 Share Posted February 4, 2009 This might do the trick for you CLICKY Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754490 Share on other sites More sharing options...
wiggst3r Posted February 4, 2009 Author Share Posted February 4, 2009 Thanks. I'll take a look at home as I'm using a Mac at the moment Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754508 Share on other sites More sharing options...
Mark Baker Posted February 4, 2009 Share Posted February 4, 2009 In a similar situation, I'm inserting the data into a temporary database table with a unique constraint on the e-mail address column. You could do something similar, and then rewrite database table data back to the CSV. Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754539 Share on other sites More sharing options...
printf Posted February 4, 2009 Share Posted February 4, 2009 if you use PHP, use a reader / writer with a temp directory to store your file hash references, that way you use very little memory and your compare is based on an IO handle which is cached so the lookup is faster than storing arrays and doing string or array based comparisons in your loop. It will speedup the process and use the least amount of memory! I'll give you example if you want one... Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754566 Share on other sites More sharing options...
gevans Posted February 4, 2009 Share Posted February 4, 2009 I don't know if wiggst3r wants an example but I would Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754570 Share on other sites More sharing options...
printf Posted February 4, 2009 Share Posted February 4, 2009 My point was "using the least amount memory", so using line by line reading and writing will use the least amount of memory. Then if you need to compare one or more "FIELDS" in each row you don't need to store those comparisons in a array because you assign a file name that hash comparison. example... (to make a test CSV file) <?php $total = 0; $lines = 750000; $file = './in.csv'; $array = array ( 7, 9, 0, 6, 4, 44, 77, 22 ); $io = fopen ( $file, 'wb' ); while ( ++$total <= $lines ) { shuffle ( $array ); fputs ( $io, implode ( "', '", $array ) . "\r\n" ); } fclose ( $io ); ?> Then to eliminate the duplicates that exist in "FIELDS" => 0, 2 & 4... // note, the $temp folder must exist before running the script <?php $s = microtime ( true ); // file to read $in = './in.csv'; // file to write $out = './out.csv'; // temp directory... $temp = './files'; // csv values split by... $split = "', '"; // colums to remove duplicates from (if they all match other rows) $columns = array ( 0, 2, 4 ); $fi = fopen ( $in, 'rb' ); $fo = fopen ( $out, 'wb' ); while ( ! feof ( $fi ) ) { $l = fgets ( $fi, 4096 ); $a = explode ( $split, $l ); $h = ''; foreach ( $columns AS $e ) { $h .= $a[$e]; } $h = '_' . crc32 ( $h ); if ( ! file_exists ( $temp . '/' . $h ) ) { file_put_contents ( $temp . '/' . $h, '' ); fwrite ( $fo, $l ); } } fclose ( $fi ); fclose ( $fo ); if ( $fd = opendir ( $temp ) ) { while ( false !== ( $file = readdir ( $fd ) ) ) { if ( $file != '.' && $file != '..' ) { unlink ( $temp . '/' . $file ); } } closedir ( $fd ); } echo microtime ( true ) - $s; ?> Quote Link to comment https://forums.phpfreaks.com/topic/143786-remove-duplicates-form-csv/#findComment-754614 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.