Find and delete duplicates under the right condition

PatrickPetersen · May 28, 2013

Hello - First of all, I have low->medium knowledge of php, but it hasn't been used for allmost 10 years - But I'm all stuck here!

I hope someone can help me, I'm pretty sure I will be getting retarded soon..

I have a CSV-file that has a structure like this:

street, status, project

Road Street 2, Possible, HyperGalaxy

Street Road 4, Ready, Galaxy

Road Street 2, Unknown, Unkown

I've tried to import it into excel and making a macro, but after 3-5 hours of processing, my computer crashes and I have to start all over.

My goal is that "Road Street 2, Unknown, Unknown" is deleted because of the status. - Possible is "better" than unknown - in this case..

I have around 300.000 rows and I know there will be approx. 10.000 duplicated values (and therefore 5.000 unique)

Are there any way of deleting duplicates by a "condition"?

Barand · May 28, 2013

Backup your data first

DELETE profile 
FROM profile
JOIN (
    SELECT street FROM profile
    GROUP BY street
    HAVING COUNT(*) > 1
) as dupes USING (street)
WHERE status = 'Unknown'

PatrickPetersen · May 29, 2013

Backup your data first

DELETE profile 
FROM profile
JOIN (
    SELECT street FROM profile
    GROUP BY street
    HAVING COUNT(*) > 1
) as dupes USING (street)
WHERE status = 'Unknown'

Hi - Maybe I'm just getting stupid as fuck as the time goes by.. But isn't that more of a call to the sql database? (haven't done csv-databases before making this survey)

mac_gyver · May 29, 2013

here's some coding hints for doing this using php code -

1) use fgetcsv() for reading/parsing each line from the csv file.

2) since you need a count of each street address, you would need to read through all the data once, storing the street addresses into an array, then use array_count_values() to get the count of each street address.

3) read through the data a second time to process it. for each street address, the array obtained in the above step tells you how many times that address exists in the data. you can take that count and the status value to produce the logic needed to ignore/remove any rows.

Barand · May 29, 2013

Hope this makes amends (at least you've seen the benefit of a database)

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    $dupe = count($streetdata) > 1;
    foreach ($streetdata as $line) {
        if ($dupe && trim($line[1])=='Unknown') {
            continue;
        } else {
            fputcsv($fp, $line);
        }
    } 
}
fclose($fp);

PatrickPetersen · May 29, 2013

Hope this makes amends (at least you've seen the benefit of a database)

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    $dupe = count($streetdata) > 1;
    foreach ($streetdata as $line) {
        if ($dupe && trim($line[1])=='Unknown') {
            continue;
        } else {
            fputcsv($fp, $line);
        }
    } 
}
fclose($fp);

This works!

Now to a more difficult part.. I have made a "ranking" of the status' (Which status weighs more than the other)

How can it be implementet?

My ranking is:

Ready

Possible

BBU - Possible

Pipe ready

BBU - Pipe ready

Established

Waiting for planning

Study

Not possible/Project delayed

Not possible

Not relevant

Unknown

BBU - Unknown

It's for my bachelor thesis so I really do appreciate your help!

Barand · May 29, 2013

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}

PatrickPetersen · May 29, 2013

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}

Cool! I've tried that now and it writes.. Just give me 2 minuttes and I'll run it

PatrickPetersen · May 29, 2013

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}

It makes a file on my server, but the data seems unchanged

You can see my input.csv at http://thesis.mesmerize.dk/input.csv

and after you have run the script at http://thesis.mesmerize.dk/input.php you get the output at http://thesis.mesmerize.dk/output.csv

Barand · May 29, 2013

Did you allow for mine opening "input.txt" and writing to "output.txt"?

PatrickPetersen · May 29, 2013

Did you allow for mine opening "input.txt" and writing to "output.txt"?

You're thinking about the file permission right? I've checked all boxes just in case..(CHMod 777) but it still gives the same..

A case from the csv/txt-file:

2300 København S Englandsvej 160;Not possible/Project delayed;

2300 København S Englandsvej 160;Unknown;

It should remove

"2300 København S Englandsvej 160;Unknown;"

and keep

"2300 København S Englandsvej 160;Not possible/Project delayed;"

jazzman1 · May 29, 2013

Does that happen only with strings containing spacial chars like "København" or not?

PatrickPetersen · May 29, 2013

Does that happen only with strings containing spacial chars like "København" or not?

It's not only in lines with special chars

jazzman1 · May 29, 2013

If you want to remove every lines containg the word "Unknown" itself, the easiest way is to create a RegEx which pattern should match every line containing that word.

Open up and read the .csv file and use preg_replace replacing all matched lines with null result.

Your RegEx pattern should look something like: (according that link -http://thesis.mesmerize.dk/input.csv )

^".*;Unknown"$

EDIT: If you are using some IDE like NetBeans, you can open the csv file, go to the replace tab and enter my regex string replacing all of them with null result.

PatrickPetersen · May 29, 2013

Unfortunately, that's not enough.. It needs to weigh the value.. Not all duplicates have for example unknown as a status. It needs to meet the conditions in the array.

If the same address from the example before had Ready in the first and not relevant, it should delete the not relevant entry because ready is higher than not relevant

jazzman1 · May 29, 2013

Is there something common among all duplicates something unique?

PatrickPetersen · May 29, 2013

Nope the column street is a consolidation of zip code city street name street number

And the status column is 12 or 13 different statuses.

Usually I would think the address could have been a unique, but when I randomly made a check for duplicates I found that 5% of the full data set contained duplicates.

That's why I need a conditional deletion.

Barand · May 29, 2013

Your example used commas as separators in the CSV, you are now using semicolons.

mikosiko · May 29, 2013

and everything enclosed in " as an unique field....

PatrickPetersen · May 29, 2013

Oh god! It works! You just saved my thesis ! thank you SO much!!

Sign In

Find and delete duplicates under the right condition

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information