Jump to content

Remove duplicates form CSV


wiggst3r

Recommended Posts

Found this litte script online.  You may want to use some kind of recursive method to take chunks of the CSV if it's running slow.

 

  $filename = "file.csv";
$file = fopen($filename, "r");
$read = fread($file, filesize($filename));

$split = array_unique(explode("\n", $read));

fclose($file);

$filename2 = "other.csv";

$file2 = fopen($filename2, "a");

foreach($split as $key=>$value) {
    if($value != "") {
        fwrite($file2, $value . "\n");
    }
}

fclose($file2);

echo "Update done successfully.";
?> 

if you use PHP, use a reader / writer with a temp directory to store your file hash references, that way you use very little memory and your compare is based on an IO handle which is cached so the lookup is faster than storing arrays and doing string or array based comparisons in your loop. It will speedup the process and use the least amount of memory! I'll give you example if you want one...

 

 

My point was "using the least amount memory", so using line by line reading and writing will use the least amount of memory. Then if you need to compare one or more "FIELDS" in each row you don't need to store those comparisons in a array because you assign a file name that hash comparison.

 

example... (to make a test CSV file)

 


<?php

$total = 0;

$lines = 750000;

$file = './in.csv';

$array = array ( 7, 9, 0, 6, 4, 44, 77, 22 );

$io = fopen ( $file, 'wb' );

while ( ++$total <= $lines )
{
shuffle ( $array );

fputs ( $io, implode ( "', '", $array ) . "\r\n" );
}

fclose ( $io );

?>

 

Then to eliminate the duplicates that exist in "FIELDS" => 0, 2 & 4...

 

// note, the $temp folder must exist before running the script

 


<?php

$s = microtime ( true );

// file to read

$in = './in.csv';

// file to write

$out = './out.csv';

// temp directory...

$temp = './files';

// csv values split by...

$split = "', '";

// colums to remove duplicates from (if they all match other rows)

$columns = array ( 0, 2, 4 );

$fi = fopen ( $in, 'rb' );

$fo = fopen ( $out, 'wb' );

while ( ! feof ( $fi ) )
{
$l = fgets ( $fi, 4096 );

$a = explode ( $split, $l );

$h = '';

foreach ( $columns AS $e )
{
	$h .= $a[$e];
}

$h = '_' . crc32 ( $h );

if ( ! file_exists ( $temp . '/' . $h ) )
{
	file_put_contents ( $temp . '/' . $h, '' );

	fwrite ( $fo, $l );
}
}

fclose ( $fi );

fclose ( $fo );

if ( $fd = opendir ( $temp ) )
{
while ( false !== ( $file = readdir ( $fd ) ) )
{
	if ( $file != '.' && $file != '..' )
	{
		unlink ( $temp . '/' . $file );
	}
}

closedir ( $fd );
}

echo microtime ( true ) - $s;

?>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.