Jump to content

Find and delete duplicates under the right condition


Go to solution Solved by Barand,

Recommended Posts

Hello - First of all, I have low->medium knowledge of php, but it hasn't been used for allmost 10 years - But I'm all stuck here!

 

I hope someone can help me, I'm pretty sure I will be getting retarded soon..

 

I have a CSV-file that has a structure like this:

 

    street, status, project

    Road Street 2, Possible, HyperGalaxy

    Street Road 4, Ready, Galaxy

    Road Street 2, Unknown, Unkown

 

I've tried to import it into excel and making a macro, but after 3-5 hours of processing, my computer crashes and I have to start all over.

 

My goal is that "Road Street 2, Unknown, Unknown" is deleted because of the status. - Possible is "better" than unknown - in this case..

 

I have around 300.000 rows and I know there will be approx. 10.000 duplicated values (and therefore 5.000 unique)

 

Are there any way of deleting duplicates by a "condition"?
Link to comment
Share on other sites

 

Backup your data first

DELETE profile 
FROM profile
JOIN (
    SELECT street FROM profile
    GROUP BY street
    HAVING COUNT(*) > 1
) as dupes USING (street)
WHERE status = 'Unknown'

Hi - Maybe I'm just getting stupid as fuck as the time goes by.. But isn't that more of a call to the sql database? (haven't done csv-databases before making this survey) :(

Link to comment
Share on other sites

here's some coding hints for doing this using php code -

 

1) use fgetcsv() for reading/parsing each line from the csv file.

 

2) since you need a count of each street address, you would need to read through all the data once, storing the street addresses into an array, then use array_count_values() to get the count of each street address.

 

3) read through the data a second time to process it. for each street address, the array obtained in the above step tells you how many times that address exists in the data. you can take that count and the status value to produce the logic needed to ignore/remove any rows.

Link to comment
Share on other sites

Hope this makes amends (at least you've seen the benefit of a database)

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    $dupe = count($streetdata) > 1;
    foreach ($streetdata as $line) {
        if ($dupe && trim($line[1])=='Unknown') {
            continue;
        } else {
            fputcsv($fp, $line);
        }
    } 
}
fclose($fp);
Link to comment
Share on other sites

 

Hope this makes amends (at least you've seen the benefit of a database)

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    $dupe = count($streetdata) > 1;
    foreach ($streetdata as $line) {
        if ($dupe && trim($line[1])=='Unknown') {
            continue;
        } else {
            fputcsv($fp, $line);
        }
    } 
}
fclose($fp);

This works! :D

 

Now to a more difficult part.. I have made a "ranking" of the status' (Which status weighs more than the other)

 

How can it be implementet? 

 

My ranking is:

 

    Ready
    Possible
    BBU - Possible
    Pipe ready
    BBU - Pipe ready
    Established
    Waiting for planning
    Study
    Not possible/Project delayed
    Not possible
    Not relevant
    Unknown
    BBU - Unknown

 

It's for my bachelor thesis so I really do appreciate your help! :)

Link to comment
Share on other sites

  • Solution

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}
Link to comment
Share on other sites

 

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}

Cool! I've tried that now and it writes.. Just give me 2 minuttes and I'll run it

Edited by PatrickPetersen
Link to comment
Share on other sites

 

Needs a little adjustment to sort the data by rank for each street

$fp = fopen('input.txt', 'r');
$data = array();
$head = fgetcsv($fp, 1024); // header line
while ($line = fgetcsv($fp, 1024)) {
    $data[$line[0]][] = $line;
}
fclose($fp);
//
// write to file
//
$fp = fopen('output.txt', 'w');
fputcsv($fp, $head);
foreach ($data as $street=>$streetdata) {
    // use a custom sort to sort by ranking
    usort($streetdata, 'sortrank');
    // we only want the one with the highest ranking
    $line = array_shift($streetdata);
    fputcsv($fp, $line);
}
fclose($fp);

function sortrank($a, $b) 
{
    $rank = array (
            'Ready'                         => 1,
            'Possible'                      => 2,
            'BBU - Possible'                => 3,
            'Pipe ready'                    => 4,
            'BBU - Pipe ready'              => 5,
            'Established'                   => 6,
            'Waiting for planning'          => 7,
            'Study'                         => 8,
            'Not possible/Project delayed'  => 9,
            'Not possible'                  => 10,
            'Not relevant'                  => 11,
            'Unknown'                       => 12,
            'BBU - Unknown'                 => 13
            );

    return $rank[trim($a[1])] - $rank[trim($b[1])];
}

It makes a file on my server, but the data seems unchanged :(

 

You can see my input.csv at http://thesis.mesmerize.dk/input.csv

and after you have run the script at http://thesis.mesmerize.dk/input.php you get the output at http://thesis.mesmerize.dk/output.csv 

Link to comment
Share on other sites

Did you allow for mine opening "input.txt" and writing to "output.txt"?

 

You're thinking about the file permission right? I've checked all boxes just in case..(CHMod 777) but it still gives the same.. 

 

 
A case from the csv/txt-file:
 
2300 København S Englandsvej 160;Not possible/Project delayed;
2300 København S Englandsvej 160;Unknown;
 
It should remove
"2300 København S Englandsvej 160;Unknown;"
and keep
"2300 København S Englandsvej 160;Not possible/Project delayed;"
Edited by PatrickPetersen
Link to comment
Share on other sites

If you want to remove every lines containg the word "Unknown" itself, the easiest way is to create a RegEx which pattern should match every line containing that word.

 

Open up and read the .csv file and use preg_replace replacing all matched lines with null result.

 

Your RegEx pattern should look something like: (according that link -http://thesis.mesmerize.dk/input.csv )

^".*;Unknown"$

EDIT: If you are using some IDE like NetBeans, you can open the csv file, go to the replace tab and enter my regex string replacing all of them with null result.

Edited by jazzman1
Link to comment
Share on other sites

Unfortunately, that's not enough.. It needs to weigh the value.. Not all duplicates have for example unknown as a status. It needs to meet the conditions in the array.

 

If the same address from the example before had Ready in the first and not relevant, it should delete the not relevant entry because ready is higher than not relevant :)

Link to comment
Share on other sites

Nope the column street is a consolidation of zip code city street name street number

And the status column is 12 or 13 different statuses.

 

Usually I would think the address could have been a unique, but when I randomly made a check for duplicates I found that 5% of the full data set contained duplicates.

 

That's why I need a conditional deletion. :)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.