dstaton Posted May 3, 2007 Share Posted May 3, 2007 I have an Oracle Database of Movies. I need to find duplicate "titles" using a SQL Query. Every Title has a unique Title_ID, but there are some duplicates Programme_Names. I need to find the duplicates in order to merge them in the actual database through the merge feature in there. I don't know how to ask for duplicate records in a column. Also, there could be slight differences in the titles. One could have been input as "The Darkling" and the other as "Darkling". Any input would be greatly appreciated. Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 3, 2007 Author Share Posted May 3, 2007 When I run my query I get over 8000 rows. For the titles beginning with "A", I get 546 titles (rows). There are 12 possible duplicates. This is the results I want to see. TITLE_ID PROGRAMME_NAME YEAR OF PRODUCTION 59055 Abandoned 1949 81474 Abandoned, The 2006 16912 Ablaze 0 21468 Ablaze 2001 18971 Addams Family 1991 15668 Addams' Family, The 1991 19267 After Dark, My Sweet 1990 78865 After Dark, My Sweet 1990 18786 Airplane 0 17740 Airplane! 1980 22643 Alvin & The Chipmunks: Meet The Wolfman 2000 22479 Alvin And The Chipmunks Meet The Wolfman 2000 I don't know how to define the Programme_Name in the where clause in order to get only possible duplicates. Quote Link to comment Share on other sites More sharing options...
fenway Posted May 3, 2007 Share Posted May 3, 2007 You'll have to come up with "rules" to define what you mean by "duplicates". Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 3, 2007 Author Share Posted May 3, 2007 That's the problem. I am very new to this. I don't know how to define the rules. Quote Link to comment Share on other sites More sharing options...
fenway Posted May 3, 2007 Share Posted May 3, 2007 Neither do -- what do you mean by duplicates? Quote Link to comment Share on other sites More sharing options...
Wildbug Posted May 3, 2007 Share Posted May 3, 2007 This is only half a matter of programming/knowing SQL. The other half is dealing with the problem of language and differences. There are many different types of differences in the titles you have listed there. You could probably divine some rules from that alone, but you might be missing others and you might also incorrectly match two different movies as the same one. Based on what you have there you might consider something like removing punctuation, the words "and" and "the", perhaps changing Arabic numbers to their "word" counterpart (1 -> "one").... and take into account the year of production in cases in which a year is known. These are some of the things you'll have to look at. To actually do this would at least require a transitional table (a "scratch" table) to keep your work in.... Not too easy. Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 4, 2007 Share Posted May 4, 2007 Does this return more title duplicates than just similar titles? SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies GROUP BY SOUNDEX(programme_name) HAVING occurs > 1 Quote Link to comment Share on other sites More sharing options...
Barand Posted May 4, 2007 Share Posted May 4, 2007 I just tried the query on the movies listed and it lists all but the Addams family ones [pre] TITLE SOUNDEX 'Abandoned', 'A15353' 'Abandoned, The', 'A15353' 'Ablaze', 'A142' 'Ablaze', 'A142' 'Addams Family', 'A352154' <-- 'Addams's Family, The', 'A3521543' <-- 'After Dark, My Sweet', 'A136362523' 'After Dark, My Sweet', 'A136362523' 'Airplane', 'A6145' 'Airplane!', 'A6145' 'Alvin & The Chipmunks: Meet The Wolfman', 'A4153215253415' 'Alvin And The Chipmunks Meet The Wolfman ', 'A4153215253415' [/pre] Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 4, 2007 Share Posted May 4, 2007 Yeah, I know... I've also tried the query on the list provided... but I am wondering how many false positives (if any) are returned from the entire 8000. The soundex on the Addams family string is still very close. Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 4, 2007 Share Posted May 4, 2007 dstation, how does the following query do on all 8000 records? SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', '')) HAVING occurs > 1 Quote Link to comment Share on other sites More sharing options...
Wildbug Posted May 4, 2007 Share Posted May 4, 2007 You should include year, too, accounting for remakes and different movies with the same title. Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 4, 2007 Author Share Posted May 4, 2007 When I try to run the following: select group_concat(programme_name SEPARATOR '\n') as title, count(*), title_id, year_of_production as occurs from title GROUP BY SOUNDEX(programme_name) HAVING occurs > 1 I get ORA-00907: missing right parenthesis I don't see where I am missing a right parenthesis. Also, in the other example of the code, what does the following mean? GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', '')) Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 5, 2007 Share Posted May 5, 2007 The query is designed for MySQL and not Oracle - the error described above is an Oracle error. The `as occurs` should following `count(*)` as an alias for what every the count value is to be used in the HAVING clause. There are four function calls in play with the following (starting from the inner most and using the programme_name example `Alvin And The Chipmunks Meet The Wolfman`): GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', '')) lower('Alvin And The Chipmunks Meet The Wolfman') converts the string to lower case resulting in 'alvin and the chipmunks meet the wolfman' replace('alvin and the chipmunks meet the wolfman', ' the', '') replaces all occurances of ' the' with an empty string and results in 'alvin and chipmunks meet wolfman' replace('alvin and chipmunks meet wolfman') replaces all occurances of ' and' with an empty string and results in 'alvin chipmunks meet wolfman' soundex('alvin and chipmunks meet wolfman') returns the string into a soundex (sounds like) character string that ignores not alphabtic characters - words like pair and pare return the same string and mayor and mare also return the same string. Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 7, 2007 Author Share Posted May 7, 2007 I apologize for being so green at this. I am using TOAD to run this query. The database is Oracle. I can't get around the "missing right parenthesis" error. I fixed the "as occurs" problem (I knew that, just wasn't paying attention). select group_concat(programme_name SEPARATOR '\n') as title, count(*) as occurs, title_id, year_of_production from title GROUP BY SOUNDEX(programme_name) HAVING occurs > 1 So the replace is just to get "Like" titles, it doesn't do anything in the database. Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 8, 2007 Share Posted May 8, 2007 Maybe escaping the newline will stop the error you get in TOAD.. and yeah, replace() only affects the copy of the string that is returned and not the actual stored data. Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 8, 2007 Author Share Posted May 8, 2007 What do you mean " escaping the newline? What does "SEPARATOR '\N'" do? Quote Link to comment Share on other sites More sharing options...
bubblegum.anarchy Posted May 9, 2007 Share Posted May 9, 2007 Seperates each value returned by the group by clause with a newline character \n - escape the newline character by with an additional backslash thus: \\n Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 9, 2007 Author Share Posted May 9, 2007 I changed the query to: SELECT title_id, year_of_production, group_concat(programme_name,'\\n'), count(programme_name) AS occurs FROM title GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', '')) HAVING count(programme_name) > 1 Now I get an Oracle error message of ORA-00904: "GROUP_CONCAT": invalid identifier. Why does it think this is a column name? Quote Link to comment Share on other sites More sharing options...
Barand Posted May 9, 2007 Share Posted May 9, 2007 I suspect GROUP_CONCAT is a mysql-only function Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 9, 2007 Author Share Posted May 9, 2007 I have spent 3 days working on this exclusively. I believe you are right. When I do this: select year_of_production,programme_name, count(programme_name) FROM (select title_id, programme_name, year_of_production from title where title_type = 'P') GROUP BY programme_name, year_of_production HAVING count(*) > 1 I get: YEAR PROGRAMME_NAME COUNT 2004 Dirt 2 1999 Pups 2 2001 Torso 2 1991 Switch 2 1976 Network 2 1978 Wiz, The 2 1977 Slap Shot 2 1985 Turk 182! 2 2001 Wash, The 2 1988 Boost, The 2 0 Last Dance 2 1988 Masquerade 2 But that doesn't account for the titles that have slight differences. I also need Titles that have different Year of production (sometimes it is entered incorrectly). Any ideas you have would be greatly appreciated. Quote Link to comment Share on other sites More sharing options...
Barand Posted May 9, 2007 Share Posted May 9, 2007 I'd try something like this. I hope the SQL is standard enough, I only have MySql. <form method='post'> <table border='1'> <tr> <td>Delete </td> <td> Titles </td> <td> Year </td> </tr> <?php include '../test/db2.php'; $sql = "SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara, b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb FROM movies a INNER JOIN movies b ON ((SOUNDEX(a.programme_name) LIKE CONCAT(SOUNDEX(b.programme_name),'%')) OR (SOUNDEX(b.programme_name) LIKE CONCAT(SOUNDEX(a.programme_name),'%'))) WHERE a.title_id < b.title_id"; $res = mysql_query($sql) or die (mysql_error()."<p>$sql</p>"); while (list($ida, $na, $ya, $idb, $nb, $yb) = mysql_fetch_row($res)) { echo <<<ROW <tr> <td><input type='checkbox' name='del_id[]' value='$ida'><br> <input type='checkbox' name='del_id[]' value='$idb'> </td> <td> $na<br>$nb </td> <td> $ya<br>$yb </td> </tr> ROW; } ?> </table> <input type='submit' name='action' value='Delete selected'> </form> Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 10, 2007 Author Share Posted May 10, 2007 I don't want to delete the titles. Some duplicates should be there (they are remakes or a TV Versions vs Theater Version). I just want to find them to determine if they are actually duplicates that need to be merged. There is a merge feature in the actual database. Quote Link to comment Share on other sites More sharing options...
Barand Posted May 10, 2007 Share Posted May 10, 2007 Change column heading to "Merge" and rename the button Quote Link to comment Share on other sites More sharing options...
dstaton Posted May 11, 2007 Author Share Posted May 11, 2007 That returns over 38000 rows. It returns things like: Man With One Red Shoe = Manhattan Man With One Red Shoe matches 27 other titles that are really nothing like it. I do understand why it is returning them but I am really looking for something that matches more than one word in each title. Is there any other way to do this? Quote Link to comment Share on other sites More sharing options...
Barand Posted May 11, 2007 Share Posted May 11, 2007 Try this simpler one and see how it goes as a first pass SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara, b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb FROM movies a INNER JOIN movies b ON SOUNDEX(a.programme_name = b.programme_name) WHERE a.title_id < b.title_id Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.