Jump to content

Find Duplicate Entries in Table


Recommended Posts

I have an Oracle Database of Movies.  I need to find duplicate "titles" using a SQL Query.  Every Title has a unique Title_ID, but there are some duplicates Programme_Names.  I need to find the duplicates in order to merge them in the actual database through the merge feature in there.  I don't know how to ask for duplicate records in a column.  Also, there could be slight differences in the titles.  One could have been input as "The Darkling" and the other as "Darkling".  Any input would be greatly appreciated.

 

Link to comment
Share on other sites

When I run my query I get over 8000 rows.  For the titles beginning with "A", I get 546 titles (rows).  There are 12 possible duplicates.

 

This is the results I want to see. 

 

TITLE_ID PROGRAMME_NAME   YEAR OF PRODUCTION

59055 Abandoned             1949

81474 Abandoned, The             2006

16912 Ablaze                             0

21468 Ablaze                           2001

18971 Addams Family             1991

15668 Addams' Family, The 1991

19267 After Dark, My Sweet 1990

78865 After Dark, My Sweet 1990

18786 Airplane                             0

17740 Airplane!                         1980

22643 Alvin & The Chipmunks: Meet The Wolfman 2000

22479 Alvin And The Chipmunks Meet The Wolfman 2000

 

I don't know how to define the Programme_Name in the where clause in order to get only possible duplicates. 

 

Link to comment
Share on other sites

This is only half a matter of programming/knowing SQL.  The other half is dealing with the problem of language and differences.  There are many different types of differences in the titles you have listed there.  You could probably divine some rules from that alone, but you might be missing others and you might also incorrectly match two different movies as the same one.

 

Based on what you have there you might consider something like removing punctuation, the words "and" and "the", perhaps changing Arabic numbers to their "word" counterpart (1 -> "one")....  and take into account the year of production in cases in which a year is known.  These are some of the things you'll have to look at.  To actually do this would at least require a transitional table (a "scratch" table) to keep your work in....

 

Not too easy.

Link to comment
Share on other sites

I just tried the query on the movies listed and it lists all but the Addams family ones

 

[pre]

TITLE                                          SOUNDEX

 

'Abandoned',                                  'A15353'

'Abandoned, The',                              'A15353'

'Ablaze',                                      'A142'

'Ablaze',                                      'A142'

'Addams Family',                              'A352154'            <--

'Addams's Family, The',                        'A3521543'          <--

'After Dark, My Sweet',                        'A136362523'

'After Dark, My Sweet',                        'A136362523'

'Airplane',                                    'A6145'

'Airplane!',                                  'A6145'

'Alvin & The Chipmunks: Meet The Wolfman',    'A4153215253415'

'Alvin And The Chipmunks Meet The Wolfman ',  'A4153215253415'

[/pre]

Link to comment
Share on other sites

When I try to run the following:

 

select group_concat(programme_name SEPARATOR '\n') as title, count(*), title_id, year_of_production as occurs from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

 

I get ORA-00907: missing right parenthesis

 

I don't see where I am missing a right parenthesis.

 

Also, in the other example of the code, what does the following mean?

 

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

Link to comment
Share on other sites

The query is designed for MySQL and not Oracle - the error described above is an Oracle error.

 

The `as occurs` should following `count(*)` as an alias for what every the count value is to be used in the HAVING clause.

 

There are four function calls in play with the following (starting from the inner most and using the programme_name example `Alvin And The Chipmunks Meet The Wolfman`):

 

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

 

lower('Alvin And The Chipmunks Meet The Wolfman') converts the string to lower case resulting in 'alvin and the chipmunks meet the wolfman'

 

replace('alvin and the chipmunks meet the wolfman', ' the', '') replaces all occurances of ' the' with an empty string and results in 'alvin and chipmunks meet wolfman'

 

replace('alvin and chipmunks meet wolfman') replaces all occurances of ' and' with an empty string and results in 'alvin chipmunks meet wolfman'

 

soundex('alvin and chipmunks meet wolfman') returns the string into a soundex (sounds like) character string that ignores not alphabtic characters - words like pair and pare return the same string and mayor and mare also return the same string.

Link to comment
Share on other sites

I apologize for being so green at this.  I am using TOAD to run this query.  The database is Oracle.  I can't get around the "missing right parenthesis" error.  I fixed the "as occurs" problem (I knew that, just wasn't paying attention).

 

select group_concat(programme_name SEPARATOR '\n') as title, count(*) as occurs, title_id, year_of_production

from title

GROUP BY SOUNDEX(programme_name)

HAVING occurs > 1

 

So the replace is just to get "Like" titles, it doesn't do anything in the database.

Link to comment
Share on other sites

I changed the query to:

 

SELECT title_id, year_of_production, group_concat(programme_name,'\\n'), count(programme_name) AS occurs

FROM title

GROUP BY SOUNDEX(replace(replace(lower(programme_name), ' the', ''), ' and', ''))

HAVING count(programme_name) > 1

 

Now I get an Oracle error message of ORA-00904: "GROUP_CONCAT": invalid identifier.  Why does it think this is a column name? 

Link to comment
Share on other sites

I have spent 3 days working on this exclusively.  I believe you are right.

 

When I do this:

 

select year_of_production,programme_name, count(programme_name)

FROM (select title_id, programme_name, year_of_production from title where title_type = 'P')

GROUP BY programme_name, year_of_production

HAVING count(*) > 1

 

I get:

 

YEAR PROGRAMME_NAME COUNT

2004 Dirt             2

1999 Pups             2

2001 Torso             2

1991 Switch             2

1976 Network             2

1978 Wiz, The             2

1977 Slap Shot 2

1985 Turk 182!             2

2001 Wash, The 2

1988 Boost, The 2

0 Last Dance 2

1988 Masquerade 2

 

But that doesn't account for the titles that have slight differences.  I also need Titles that have different Year of production (sometimes it is entered incorrectly).  Any ideas you have would be greatly appreciated.

Link to comment
Share on other sites

I'd try something like this. I hope the SQL is standard enough, I only have MySql.

 

<form method='post'>
<table border='1'>
     <tr>
        <td>Delete
        </td>
        <td>
            Titles
        </td>
        <td>
            Year
        </td>
     </tr>

<?php
include '../test/db2.php';

$sql = "SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,
        b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb
        FROM movies a INNER JOIN movies b
        ON ((SOUNDEX(a.programme_name) LIKE CONCAT(SOUNDEX(b.programme_name),'%'))
         OR (SOUNDEX(b.programme_name) LIKE CONCAT(SOUNDEX(a.programme_name),'%')))
        WHERE a.title_id < b.title_id";

$res = mysql_query($sql) or die (mysql_error()."<p>$sql</p>");

while (list($ida, $na, $ya, $idb, $nb, $yb) = mysql_fetch_row($res)) {

echo <<<ROW
     <tr>
        <td><input type='checkbox' name='del_id[]' value='$ida'><br>
            <input type='checkbox' name='del_id[]' value='$idb'>
        </td>
        <td>
            $na<br>$nb
        </td>
        <td>
            $ya<br>$yb
        </td>
     </tr>
ROW;
}
?>

</table>
<input type='submit' name='action' value='Delete selected'>
</form>

Link to comment
Share on other sites

I don't want to delete the titles.  Some duplicates should be there (they are remakes or a TV Versions vs Theater Version).  I just want to find them to determine if they are actually duplicates that need to be merged.  There is a merge feature in the actual database.

 

 

Link to comment
Share on other sites

That returns over 38000 rows.  It returns things like:  Man With One Red Shoe = Manhattan

 

Man With One Red Shoe matches 27 other titles that are really nothing like it.  I do understand why it is returning them but I am really looking for something that matches more than one word in each title.  Is there any other way to do this?

Link to comment
Share on other sites

Try this simpler one and see how it goes as a first pass

 

SELECT a.title_id as ida, a.programme_name as titlea, a.year_of_production as yeara,

        b.title_id as idb, b.programme_name as titleb, b.year_of_production as yearb

        FROM movies a INNER JOIN movies b

        ON SOUNDEX(a.programme_name = b.programme_name)

        WHERE a.title_id < b.title_id

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.