Jump to content

Remove duplicates form CSV


wiggst3r

Recommended Posts

Found this litte script online.  You may want to use some kind of recursive method to take chunks of the CSV if it's running slow.

 

  $filename = "file.csv";
$file = fopen($filename, "r");
$read = fread($file, filesize($filename));

$split = array_unique(explode("\n", $read));

fclose($file);

$filename2 = "other.csv";

$file2 = fopen($filename2, "a");

foreach($split as $key=>$value) {
    if($value != "") {
        fwrite($file2, $value . "\n");
    }
}

fclose($file2);

echo "Update done successfully.";
?> 

Link to comment
Share on other sites

if you use PHP, use a reader / writer with a temp directory to store your file hash references, that way you use very little memory and your compare is based on an IO handle which is cached so the lookup is faster than storing arrays and doing string or array based comparisons in your loop. It will speedup the process and use the least amount of memory! I'll give you example if you want one...

 

 

Link to comment
Share on other sites

My point was "using the least amount memory", so using line by line reading and writing will use the least amount of memory. Then if you need to compare one or more "FIELDS" in each row you don't need to store those comparisons in a array because you assign a file name that hash comparison.

 

example... (to make a test CSV file)

 


<?php

$total = 0;

$lines = 750000;

$file = './in.csv';

$array = array ( 7, 9, 0, 6, 4, 44, 77, 22 );

$io = fopen ( $file, 'wb' );

while ( ++$total <= $lines )
{
shuffle ( $array );

fputs ( $io, implode ( "', '", $array ) . "\r\n" );
}

fclose ( $io );

?>

 

Then to eliminate the duplicates that exist in "FIELDS" => 0, 2 & 4...

 

// note, the $temp folder must exist before running the script

 


<?php

$s = microtime ( true );

// file to read

$in = './in.csv';

// file to write

$out = './out.csv';

// temp directory...

$temp = './files';

// csv values split by...

$split = "', '";

// colums to remove duplicates from (if they all match other rows)

$columns = array ( 0, 2, 4 );

$fi = fopen ( $in, 'rb' );

$fo = fopen ( $out, 'wb' );

while ( ! feof ( $fi ) )
{
$l = fgets ( $fi, 4096 );

$a = explode ( $split, $l );

$h = '';

foreach ( $columns AS $e )
{
	$h .= $a[$e];
}

$h = '_' . crc32 ( $h );

if ( ! file_exists ( $temp . '/' . $h ) )
{
	file_put_contents ( $temp . '/' . $h, '' );

	fwrite ( $fo, $l );
}
}

fclose ( $fi );

fclose ( $fo );

if ( $fd = opendir ( $temp ) )
{
while ( false !== ( $file = readdir ( $fd ) ) )
{
	if ( $file != '.' && $file != '..' )
	{
		unlink ( $temp . '/' . $file );
	}
}

closedir ( $fd );
}

echo microtime ( true ) - $s;

?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.