PatrickPetersen Posted May 28, 2013 Share Posted May 28, 2013 Hello - First of all, I have low->medium knowledge of php, but it hasn't been used for allmost 10 years - But I'm all stuck here! I hope someone can help me, I'm pretty sure I will be getting retarded soon.. I have a CSV-file that has a structure like this: street, status, project Road Street 2, Possible, HyperGalaxy Street Road 4, Ready, Galaxy Road Street 2, Unknown, Unkown I've tried to import it into excel and making a macro, but after 3-5 hours of processing, my computer crashes and I have to start all over. My goal is that "Road Street 2, Unknown, Unknown" is deleted because of the status. - Possible is "better" than unknown - in this case.. I have around 300.000 rows and I know there will be approx. 10.000 duplicated values (and therefore 5.000 unique) Are there any way of deleting duplicates by a "condition"? Quote Link to comment Share on other sites More sharing options...
Barand Posted May 28, 2013 Share Posted May 28, 2013 Backup your data first DELETE profile FROM profile JOIN ( SELECT street FROM profile GROUP BY street HAVING COUNT(*) > 1 ) as dupes USING (street) WHERE status = 'Unknown' Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Backup your data first DELETE profile FROM profile JOIN ( SELECT street FROM profile GROUP BY street HAVING COUNT(*) > 1 ) as dupes USING (street) WHERE status = 'Unknown' Hi - Maybe I'm just getting stupid as fuck as the time goes by.. But isn't that more of a call to the sql database? (haven't done csv-databases before making this survey) Quote Link to comment Share on other sites More sharing options...
mac_gyver Posted May 29, 2013 Share Posted May 29, 2013 here's some coding hints for doing this using php code - 1) use fgetcsv() for reading/parsing each line from the csv file. 2) since you need a count of each street address, you would need to read through all the data once, storing the street addresses into an array, then use array_count_values() to get the count of each street address. 3) read through the data a second time to process it. for each street address, the array obtained in the above step tells you how many times that address exists in the data. you can take that count and the status value to produce the logic needed to ignore/remove any rows. Quote Link to comment Share on other sites More sharing options...
Barand Posted May 29, 2013 Share Posted May 29, 2013 Hope this makes amends (at least you've seen the benefit of a database) $fp = fopen('input.txt', 'r'); $data = array(); $head = fgetcsv($fp, 1024); // header line while ($line = fgetcsv($fp, 1024)) { $data[$line[0]][] = $line; } fclose($fp); // // write to file // $fp = fopen('output.txt', 'w'); fputcsv($fp, $head); foreach ($data as $street=>$streetdata) { $dupe = count($streetdata) > 1; foreach ($streetdata as $line) { if ($dupe && trim($line[1])=='Unknown') { continue; } else { fputcsv($fp, $line); } } } fclose($fp); Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Hope this makes amends (at least you've seen the benefit of a database) $fp = fopen('input.txt', 'r'); $data = array(); $head = fgetcsv($fp, 1024); // header line while ($line = fgetcsv($fp, 1024)) { $data[$line[0]][] = $line; } fclose($fp); // // write to file // $fp = fopen('output.txt', 'w'); fputcsv($fp, $head); foreach ($data as $street=>$streetdata) { $dupe = count($streetdata) > 1; foreach ($streetdata as $line) { if ($dupe && trim($line[1])=='Unknown') { continue; } else { fputcsv($fp, $line); } } } fclose($fp); This works! Now to a more difficult part.. I have made a "ranking" of the status' (Which status weighs more than the other) How can it be implementet? My ranking is: Ready Possible BBU - Possible Pipe ready BBU - Pipe ready Established Waiting for planning Study Not possible/Project delayed Not possible Not relevant Unknown BBU - Unknown It's for my bachelor thesis so I really do appreciate your help! Quote Link to comment Share on other sites More sharing options...
Solution Barand Posted May 29, 2013 Solution Share Posted May 29, 2013 Needs a little adjustment to sort the data by rank for each street $fp = fopen('input.txt', 'r'); $data = array(); $head = fgetcsv($fp, 1024); // header line while ($line = fgetcsv($fp, 1024)) { $data[$line[0]][] = $line; } fclose($fp); // // write to file // $fp = fopen('output.txt', 'w'); fputcsv($fp, $head); foreach ($data as $street=>$streetdata) { // use a custom sort to sort by ranking usort($streetdata, 'sortrank'); // we only want the one with the highest ranking $line = array_shift($streetdata); fputcsv($fp, $line); } fclose($fp); function sortrank($a, $b) { $rank = array ( 'Ready' => 1, 'Possible' => 2, 'BBU - Possible' => 3, 'Pipe ready' => 4, 'BBU - Pipe ready' => 5, 'Established' => 6, 'Waiting for planning' => 7, 'Study' => 8, 'Not possible/Project delayed' => 9, 'Not possible' => 10, 'Not relevant' => 11, 'Unknown' => 12, 'BBU - Unknown' => 13 ); return $rank[trim($a[1])] - $rank[trim($b[1])]; } Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 (edited) Needs a little adjustment to sort the data by rank for each street $fp = fopen('input.txt', 'r'); $data = array(); $head = fgetcsv($fp, 1024); // header line while ($line = fgetcsv($fp, 1024)) { $data[$line[0]][] = $line; } fclose($fp); // // write to file // $fp = fopen('output.txt', 'w'); fputcsv($fp, $head); foreach ($data as $street=>$streetdata) { // use a custom sort to sort by ranking usort($streetdata, 'sortrank'); // we only want the one with the highest ranking $line = array_shift($streetdata); fputcsv($fp, $line); } fclose($fp); function sortrank($a, $b) { $rank = array ( 'Ready' => 1, 'Possible' => 2, 'BBU - Possible' => 3, 'Pipe ready' => 4, 'BBU - Pipe ready' => 5, 'Established' => 6, 'Waiting for planning' => 7, 'Study' => 8, 'Not possible/Project delayed' => 9, 'Not possible' => 10, 'Not relevant' => 11, 'Unknown' => 12, 'BBU - Unknown' => 13 ); return $rank[trim($a[1])] - $rank[trim($b[1])]; } Cool! I've tried that now and it writes.. Just give me 2 minuttes and I'll run it Edited May 29, 2013 by PatrickPetersen Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Needs a little adjustment to sort the data by rank for each street $fp = fopen('input.txt', 'r'); $data = array(); $head = fgetcsv($fp, 1024); // header line while ($line = fgetcsv($fp, 1024)) { $data[$line[0]][] = $line; } fclose($fp); // // write to file // $fp = fopen('output.txt', 'w'); fputcsv($fp, $head); foreach ($data as $street=>$streetdata) { // use a custom sort to sort by ranking usort($streetdata, 'sortrank'); // we only want the one with the highest ranking $line = array_shift($streetdata); fputcsv($fp, $line); } fclose($fp); function sortrank($a, $b) { $rank = array ( 'Ready' => 1, 'Possible' => 2, 'BBU - Possible' => 3, 'Pipe ready' => 4, 'BBU - Pipe ready' => 5, 'Established' => 6, 'Waiting for planning' => 7, 'Study' => 8, 'Not possible/Project delayed' => 9, 'Not possible' => 10, 'Not relevant' => 11, 'Unknown' => 12, 'BBU - Unknown' => 13 ); return $rank[trim($a[1])] - $rank[trim($b[1])]; } It makes a file on my server, but the data seems unchanged You can see my input.csv at http://thesis.mesmerize.dk/input.csv and after you have run the script at http://thesis.mesmerize.dk/input.php you get the output at http://thesis.mesmerize.dk/output.csv Quote Link to comment Share on other sites More sharing options...
Barand Posted May 29, 2013 Share Posted May 29, 2013 Did you allow for mine opening "input.txt" and writing to "output.txt"? Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 (edited) Did you allow for mine opening "input.txt" and writing to "output.txt"? You're thinking about the file permission right? I've checked all boxes just in case..(CHMod 777) but it still gives the same.. A case from the csv/txt-file: 2300 København S Englandsvej 160;Not possible/Project delayed; 2300 København S Englandsvej 160;Unknown; It should remove "2300 København S Englandsvej 160;Unknown;" and keep "2300 København S Englandsvej 160;Not possible/Project delayed;" Edited May 29, 2013 by PatrickPetersen Quote Link to comment Share on other sites More sharing options...
jazzman1 Posted May 29, 2013 Share Posted May 29, 2013 Does that happen only with strings containing spacial chars like "København" or not? Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Does that happen only with strings containing spacial chars like "København" or not? It's not only in lines with special chars Quote Link to comment Share on other sites More sharing options...
jazzman1 Posted May 29, 2013 Share Posted May 29, 2013 (edited) If you want to remove every lines containg the word "Unknown" itself, the easiest way is to create a RegEx which pattern should match every line containing that word. Open up and read the .csv file and use preg_replace replacing all matched lines with null result. Your RegEx pattern should look something like: (according that link -http://thesis.mesmerize.dk/input.csv ) ^".*;Unknown"$ EDIT: If you are using some IDE like NetBeans, you can open the csv file, go to the replace tab and enter my regex string replacing all of them with null result. Edited May 29, 2013 by jazzman1 Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Unfortunately, that's not enough.. It needs to weigh the value.. Not all duplicates have for example unknown as a status. It needs to meet the conditions in the array. If the same address from the example before had Ready in the first and not relevant, it should delete the not relevant entry because ready is higher than not relevant Quote Link to comment Share on other sites More sharing options...
jazzman1 Posted May 29, 2013 Share Posted May 29, 2013 (edited) Is there something common among all duplicates something unique? Edited May 29, 2013 by jazzman1 Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Nope the column street is a consolidation of zip code city street name street number And the status column is 12 or 13 different statuses. Usually I would think the address could have been a unique, but when I randomly made a check for duplicates I found that 5% of the full data set contained duplicates. That's why I need a conditional deletion. Quote Link to comment Share on other sites More sharing options...
Barand Posted May 29, 2013 Share Posted May 29, 2013 Your example used commas as separators in the CSV, you are now using semicolons. Quote Link to comment Share on other sites More sharing options...
mikosiko Posted May 29, 2013 Share Posted May 29, 2013 and everything enclosed in " as an unique field.... Quote Link to comment Share on other sites More sharing options...
PatrickPetersen Posted May 29, 2013 Author Share Posted May 29, 2013 Oh god! It works! You just saved my thesis ! thank you SO much!! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.