Find Duplicate Entries in Table

dstaton · May 3, 2007

I have an Oracle Database of Movies. I need to find duplicate "titles" using a SQL Query. Every Title has a unique Title_ID, but there are some duplicates Programme_Names. I need to find the duplicates in order to merge them in the actual database through the merge feature in there. I don't know how to ask for duplicate records in a column. Also, there could be slight differences in the titles. One could have been input as "The Darkling" and the other as "Darkling". Any input would be greatly appreciated.

dstaton · May 3, 2007

When I run my query I get over 8000 rows. For the titles beginning with "A", I get 546 titles (rows). There are 12 possible duplicates.

This is the results I want to see.

TITLE_ID PROGRAMME_NAME YEAR OF PRODUCTION

59055 Abandoned 1949

81474 Abandoned, The 2006

16912 Ablaze 0

21468 Ablaze 2001

18971 Addams Family 1991

15668 Addams' Family, The 1991

19267 After Dark, My Sweet 1990

78865 After Dark, My Sweet 1990

18786 Airplane 0

17740 Airplane! 1980

22643 Alvin & The Chipmunks: Meet The Wolfman 2000

22479 Alvin And The Chipmunks Meet The Wolfman 2000

I don't know how to define the Programme_Name in the where clause in order to get only possible duplicates.

fenway · May 3, 2007

You'll have to come up with "rules" to define what you mean by "duplicates".

dstaton · May 3, 2007

That's the problem. I am very new to this. I don't know how to define the rules.

fenway · May 3, 2007

Neither do -- what do you mean by duplicates?

Wildbug · May 3, 2007

This is only half a matter of programming/knowing SQL. The other half is dealing with the problem of language and differences. There are many different types of differences in the titles you have listed there. You could probably divine some rules from that alone, but you might be missing others and you might also incorrectly match two different movies as the same one.

Based on what you have there you might consider something like removing punctuation, the words "and" and "the", perhaps changing Arabic numbers to their "word" counterpart (1 -> "one").... and take into account the year of production in cases in which a year is known. These are some of the things you'll have to look at. To actually do this would at least require a transitional table (a "scratch" table) to keep your work in....

Not too easy.

bubblegum.anarchy · May 4, 2007

Does this return more title duplicates than just similar titles?

SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies
GROUP BY SOUNDEX(programme_name)
HAVING occurs > 1

Barand · May 4, 2007

I just tried the query on the movies listed and it lists all but the Addams family ones

[pre]

TITLE SOUNDEX

'Abandoned', 'A15353'

'Abandoned, The', 'A15353'

'Ablaze', 'A142'

'Addams Family', 'A352154' <--

'Addams's Family, The', 'A3521543' <--

'After Dark, My Sweet', 'A136362523'

'Airplane', 'A6145'

'Airplane!', 'A6145'

'Alvin & The Chipmunks: Meet The Wolfman', 'A4153215253415'

'Alvin And The Chipmunks Meet The Wolfman ', 'A4153215253415'

[/pre]

bubblegum.anarchy · May 4, 2007

Yeah, I know... I've also tried the query on the list provided... but I am wondering how many false positives (if any) are returned from the entire 8000.

The soundex on the Addams family string is still very close.

bubblegum.anarchy · May 4, 2007

dstation, how does the following query do on all 8000 records?

SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies
GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))
HAVING occurs > 1

Wildbug · May 4, 2007

You should include year, too, accounting for remakes and different movies with the same title.

dstaton · May 4, 2007

When I try to run the following:

select group_concat(programme_name SEPARATOR '\n') as title, count(*), title_id, year_of_production as occurs from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

I get ORA-00907: missing right parenthesis

I don't see where I am missing a right parenthesis.

Also, in the other example of the code, what does the following mean?

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

bubblegum.anarchy · May 5, 2007

The query is designed for MySQL and not Oracle - the error described above is an Oracle error.

The `as occurs` should following `count(*)` as an alias for what every the count value is to be used in the HAVING clause.

There are four function calls in play with the following (starting from the inner most and using the programme_name example `Alvin And The Chipmunks Meet The Wolfman`):

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

lower('Alvin And The Chipmunks Meet The Wolfman') converts the string to lower case resulting in 'alvin and the chipmunks meet the wolfman'

replace('alvin and the chipmunks meet the wolfman', ' the', '') replaces all occurances of ' the' with an empty string and results in 'alvin and chipmunks meet wolfman'

replace('alvin and chipmunks meet wolfman') replaces all occurances of ' and' with an empty string and results in 'alvin chipmunks meet wolfman'

soundex('alvin and chipmunks meet wolfman') returns the string into a soundex (sounds like) character string that ignores not alphabtic characters - words like pair and pare return the same string and mayor and mare also return the same string.

dstaton · May 7, 2007

I apologize for being so green at this. I am using TOAD to run this query. The database is Oracle. I can't get around the "missing right parenthesis" error. I fixed the "as occurs" problem (I knew that, just wasn't paying attention).

select group_concat(programme_name SEPARATOR '\n') as title, count(*) as occurs, title_id, year_of_production

from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

So the replace is just to get "Like" titles, it doesn't do anything in the database.

bubblegum.anarchy · May 8, 2007

Maybe escaping the newline will stop the error you get in TOAD.. and yeah, replace() only affects the copy of the string that is returned and not the actual stored data.

dstaton · May 8, 2007

What do you mean " escaping the newline? What does "SEPARATOR '\N'" do?

bubblegum.anarchy · May 9, 2007

Seperates each value returned by the group by clause with a newline character \n - escape the newline character by with an additional backslash thus: \\n

dstaton · May 9, 2007

I changed the query to:

SELECT title_id, year_of_production, group_concat(programme_name,'\\n'), count(programme_name) AS occurs

FROM title

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

HAVING count(programme_name) > 1

Now I get an Oracle error message of ORA-00904: "GROUP_CONCAT": invalid identifier. Why does it think this is a column name?

Barand · May 9, 2007

I suspect GROUP_CONCAT is a mysql-only function

dstaton · May 9, 2007

I have spent 3 days working on this exclusively. I believe you are right.

When I do this:

select year_of_production,programme_name, count(programme_name)

FROM (select title_id, programme_name, year_of_production from title where title_type = 'P')

GROUP BY programme_name, year_of_production

HAVING count(*) > 1

I get:

YEAR PROGRAMME_NAME COUNT

2004 Dirt 2

1999 Pups 2

2001 Torso 2

1991 Switch 2

1976 Network 2

1978 Wiz, The 2

1977 Slap Shot 2

1985 Turk 182! 2

2001 Wash, The 2

1988 Boost, The 2

0 Last Dance 2

1988 Masquerade 2

But that doesn't account for the titles that have slight differences. I also need Titles that have different Year of production (sometimes it is entered incorrectly). Any ideas you have would be greatly appreciated.

Barand · May 9, 2007

I'd try something like this. I hope the SQL is standard enough, I only have MySql.

<form method='post'>
<table border='1'>
     <tr>
        <td>Delete
        </td>
        <td>
            Titles
        </td>
        <td>
            Year
        </td>
     </tr>

<?php
include '../test/db2.php';

$sql = "SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,
        b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb
        FROM movies a INNER JOIN movies b
        ON ((SOUNDEX(a.programme_name) LIKE CONCAT(SOUNDEX(b.programme_name),'%'))
         OR (SOUNDEX(b.programme_name) LIKE CONCAT(SOUNDEX(a.programme_name),'%')))
        WHERE a.title_id < b.title_id";

$res = mysql_query($sql) or die (mysql_error()."<p>$sql</p>");

while (list($ida, $na, $ya, $idb, $nb, $yb) = mysql_fetch_row($res)) {

echo <<<ROW
     <tr>
        <td><input type='checkbox' name='del_id[]' value='$ida'><br>
            <input type='checkbox' name='del_id[]' value='$idb'>
        </td>
        <td>
            $na<br>$nb
        </td>
        <td>
            $ya<br>$yb
        </td>
     </tr>
ROW;
}
?>

</table>
<input type='submit' name='action' value='Delete selected'>
</form>

dstaton · May 10, 2007

I don't want to delete the titles. Some duplicates should be there (they are remakes or a TV Versions vs Theater Version). I just want to find them to determine if they are actually duplicates that need to be merged. There is a merge feature in the actual database.

Barand · May 10, 2007

Change column heading to "Merge" and rename the button

dstaton · May 11, 2007

That returns over 38000 rows. It returns things like: Man With One Red Shoe = Manhattan

Man With One Red Shoe matches 27 other titles that are really nothing like it. I do understand why it is returning them but I am really looking for something that matches more than one word in each title. Is there any other way to do this?

Barand · May 11, 2007

Try this simpler one and see how it goes as a first pass

SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,

b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb

FROM movies a INNER JOIN movies b

ON SOUNDEX(a.programme_name = b.programme_name)

WHERE a.title_id < b.title_id

Sign In

Find Duplicate Entries in Table

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information