Jump to content

Archived

This topic is now archived and is closed to further replies.

dstaton

Find Duplicate Entries in Table

Recommended Posts

I have an Oracle Database of Movies.  I need to find duplicate "titles" using a SQL Query.  Every Title has a unique Title_ID, but there are some duplicates Programme_Names.  I need to find the duplicates in order to merge them in the actual database through the merge feature in there.  I don't know how to ask for duplicate records in a column.  Also, there could be slight differences in the titles.  One could have been input as "The Darkling" and the other as "Darkling".  Any input would be greatly appreciated.

 

Share this post


Link to post
Share on other sites

When I run my query I get over 8000 rows.  For the titles beginning with "A", I get 546 titles (rows).  There are 12 possible duplicates.

 

This is the results I want to see. 

 

TITLE_ID PROGRAMME_NAME   YEAR OF PRODUCTION

59055 Abandoned             1949

81474 Abandoned, The             2006

16912 Ablaze                             0

21468 Ablaze                           2001

18971 Addams Family             1991

15668 Addams' Family, The 1991

19267 After Dark, My Sweet 1990

78865 After Dark, My Sweet 1990

18786 Airplane                             0

17740 Airplane!                         1980

22643 Alvin & The Chipmunks: Meet The Wolfman 2000

22479 Alvin And The Chipmunks Meet The Wolfman 2000

 

I don't know how to define the Programme_Name in the where clause in order to get only possible duplicates. 

 

Share this post


Link to post
Share on other sites

You'll have to come up with "rules" to define what you mean by "duplicates".

Share this post


Link to post
Share on other sites

That's the problem.  I am very new to this.  I don't know how to define the rules.

Share this post


Link to post
Share on other sites

Neither do -- what do you mean by duplicates?

Share this post


Link to post
Share on other sites

This is only half a matter of programming/knowing SQL.  The other half is dealing with the problem of language and differences.  There are many different types of differences in the titles you have listed there.  You could probably divine some rules from that alone, but you might be missing others and you might also incorrectly match two different movies as the same one.

 

Based on what you have there you might consider something like removing punctuation, the words "and" and "the", perhaps changing Arabic numbers to their "word" counterpart (1 -> "one")....  and take into account the year of production in cases in which a year is known.  These are some of the things you'll have to look at.  To actually do this would at least require a transitional table (a "scratch" table) to keep your work in....

 

Not too easy.

Share this post


Link to post
Share on other sites

Does this return more title duplicates than just similar titles?

 

SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies
GROUP BY SOUNDEX(programme_name)
HAVING occurs > 1

Share this post


Link to post
Share on other sites

I just tried the query on the movies listed and it lists all but the Addams family ones

 

[pre]

TITLE                                          SOUNDEX

 

'Abandoned',                                  'A15353'

'Abandoned, The',                              'A15353'

'Ablaze',                                      'A142'

'Ablaze',                                      'A142'

'Addams Family',                              'A352154'            <--

'Addams's Family, The',                        'A3521543'          <--

'After Dark, My Sweet',                        'A136362523'

'After Dark, My Sweet',                        'A136362523'

'Airplane',                                    'A6145'

'Airplane!',                                  'A6145'

'Alvin & The Chipmunks: Meet The Wolfman',    'A4153215253415'

'Alvin And The Chipmunks Meet The Wolfman ',  'A4153215253415'

[/pre]

Share this post


Link to post
Share on other sites

Yeah, I know... I've also tried the query on the list provided... but I am wondering how many false positives (if any) are returned from the entire 8000.

 

The soundex on the Addams family string is still very close.

Share this post


Link to post
Share on other sites

dstation, how does the following query do on all 8000 records?

 

SELECT group_concat(programme_name SEPARATOR '\n') as title, count(*) AS occurs FROM movies
GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))
HAVING occurs > 1

Share this post


Link to post
Share on other sites

You should include year, too, accounting for remakes and different movies with the same title.

Share this post


Link to post
Share on other sites

When I try to run the following:

 

select group_concat(programme_name SEPARATOR '\n') as title, count(*), title_id, year_of_production as occurs from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

 

I get ORA-00907: missing right parenthesis

 

I don't see where I am missing a right parenthesis.

 

Also, in the other example of the code, what does the following mean?

 

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

Share this post


Link to post
Share on other sites

The query is designed for MySQL and not Oracle - the error described above is an Oracle error.

 

The `as occurs` should following `count(*)` as an alias for what every the count value is to be used in the HAVING clause.

 

There are four function calls in play with the following (starting from the inner most and using the programme_name example `Alvin And The Chipmunks Meet The Wolfman`):

 

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

 

lower('Alvin And The Chipmunks Meet The Wolfman') converts the string to lower case resulting in 'alvin and the chipmunks meet the wolfman'

 

replace('alvin and the chipmunks meet the wolfman', ' the', '') replaces all occurances of ' the' with an empty string and results in 'alvin and chipmunks meet wolfman'

 

replace('alvin and chipmunks meet wolfman') replaces all occurances of ' and' with an empty string and results in 'alvin chipmunks meet wolfman'

 

soundex('alvin and chipmunks meet wolfman') returns the string into a soundex (sounds like) character string that ignores not alphabtic characters - words like pair and pare return the same string and mayor and mare also return the same string.

Share this post


Link to post
Share on other sites

I apologize for being so green at this.  I am using TOAD to run this query.  The database is Oracle.  I can't get around the "missing right parenthesis" error.  I fixed the "as occurs" problem (I knew that, just wasn't paying attention).

 

select group_concat(programme_name SEPARATOR '\n') as title, count(*) as occurs, title_id, year_of_production

from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

 

So the replace is just to get "Like" titles, it doesn't do anything in the database.

Share this post


Link to post
Share on other sites

Maybe escaping the newline will stop the error you get in TOAD.. and yeah, replace() only affects the copy of the string that is returned and not the actual stored data.

Share this post


Link to post
Share on other sites

What do you mean " escaping the newline?  What does "SEPARATOR '\N'" do? 

Share this post


Link to post
Share on other sites

Seperates each value returned by the group by clause with a newline character \n - escape the newline character by with an additional backslash thus: \\n

Share this post


Link to post
Share on other sites

I changed the query to:

 

SELECT title_id, year_of_production, group_concat(programme_name,'\\n'), count(programme_name) AS occurs

FROM title

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

HAVING count(programme_name) > 1

 

Now I get an Oracle error message of ORA-00904: "GROUP_CONCAT": invalid identifier.  Why does it think this is a column name? 

Share this post


Link to post
Share on other sites

I suspect GROUP_CONCAT is a mysql-only function

Share this post


Link to post
Share on other sites

I have spent 3 days working on this exclusively.  I believe you are right.

 

When I do this:

 

select year_of_production,programme_name, count(programme_name)

FROM (select title_id, programme_name, year_of_production from title where title_type = 'P')

GROUP BY programme_name, year_of_production

HAVING count(*) > 1

 

I get:

 

YEAR PROGRAMME_NAME COUNT

2004 Dirt             2

1999 Pups             2

2001 Torso             2

1991 Switch             2

1976 Network             2

1978 Wiz, The             2

1977 Slap Shot 2

1985 Turk 182!             2

2001 Wash, The 2

1988 Boost, The 2

0 Last Dance 2

1988 Masquerade 2

 

But that doesn't account for the titles that have slight differences.  I also need Titles that have different Year of production (sometimes it is entered incorrectly).  Any ideas you have would be greatly appreciated.

Share this post


Link to post
Share on other sites

I'd try something like this. I hope the SQL is standard enough, I only have MySql.

 

<form method='post'>
<table border='1'>
     <tr>
        <td>Delete
        </td>
        <td>
            Titles
        </td>
        <td>
            Year
        </td>
     </tr>

<?php
include '../test/db2.php';

$sql = "SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,
        b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb
        FROM movies a INNER JOIN movies b
        ON ((SOUNDEX(a.programme_name) LIKE CONCAT(SOUNDEX(b.programme_name),'%'))
         OR (SOUNDEX(b.programme_name) LIKE CONCAT(SOUNDEX(a.programme_name),'%')))
        WHERE a.title_id < b.title_id";

$res = mysql_query($sql) or die (mysql_error()."<p>$sql</p>");

while (list($ida, $na, $ya, $idb, $nb, $yb) = mysql_fetch_row($res)) {

echo <<<ROW
     <tr>
        <td><input type='checkbox' name='del_id[]' value='$ida'><br>
            <input type='checkbox' name='del_id[]' value='$idb'>
        </td>
        <td>
            $na<br>$nb
        </td>
        <td>
            $ya<br>$yb
        </td>
     </tr>
ROW;
}
?>

</table>
<input type='submit' name='action' value='Delete selected'>
</form>

Share this post


Link to post
Share on other sites

I don't want to delete the titles.  Some duplicates should be there (they are remakes or a TV Versions vs Theater Version).  I just want to find them to determine if they are actually duplicates that need to be merged.  There is a merge feature in the actual database.

 

 

Share this post


Link to post
Share on other sites

Change column heading to "Merge" and rename the button

Share this post


Link to post
Share on other sites

That returns over 38000 rows.  It returns things like:  Man With One Red Shoe = Manhattan

 

Man With One Red Shoe matches 27 other titles that are really nothing like it.  I do understand why it is returning them but I am really looking for something that matches more than one word in each title.  Is there any other way to do this?

Share this post


Link to post
Share on other sites

Try this simpler one and see how it goes as a first pass

 

SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,

        b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb

        FROM movies a INNER JOIN movies b

        ON SOUNDEX(a.programme_name = b.programme_name)

        WHERE a.title_id < b.title_id

Share this post


Link to post
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.