Jump to content

How to find duplicates


tobeyt23

Recommended Posts

I have a table with addresses in them and need to find the duplicates. Some address will be like 1000 somewhere st or 1000 somewhere Street or 100 main street ste 100 or 100 main st Suite 100. How can i match them up?

 

table would be like this:

 

id

address

city

state

zip

 

Thanks

Edited by tobeyt23
Link to comment
Share on other sites

afaik, there's no built in way this could be done directly by the database.

 

you would need to normalize the addresses, by removing punctuation, converting different white-space characters to a single type (a space), converting groups of multiple white-space characters to a single one, and then replacing abbreviations with the whole words. at that point, barring typo errors in the original data, duplicates would be identical. for the case of typo's, you could then use SOUNDEX() to find same sounding addresses (doesn't test numbers though.)

 

you could also use a levenshtein function (not available in mysql, but could be created as a user written function) to find how many characters are different between two normalized addresses. this would take into account the numbers in the address, whereas SOUNDEX() only operates on alphabetical characters.

Link to comment
Share on other sites

This how you might use levenshtein. As in the last result you could get false positives but you could probably lessen the occurence by taking zip into account also.

mysql> SELECT * FROM table1 ORDER BY address;
+----+-------------------------+
| id | address                 |
+----+-------------------------+
|  4 | 100 main st suite 100   |
|  3 | 100 main street ste 100 |
|  2 | 1000 somewhere St       |
|  1 | 1000 somewhere Street   |
|  5 | 150 Granby Avenue       |
|  6 | 206 Station Road        |
|  7 | 306 Station Road        |
+----+-------------------------+
$threshold = 6;                 // change to alter the sensitivity
$results = array();

$sql = "SELECT id, address
        FROM table1
        ORDER BY address";
$res = $db->query($sql);
$prev = $previd = '';
while (list($id, $add) = $res->fetch_row()) {
    if (levenshtein($prev,$add) <= $threshold) {
        $results[] = array(
                        $previd=>$prev,
                        $id=>$add
                    );
    }
    $prev = $add;
    $previd = $id;
}
echo '<pre>',print_r($results, true),'</pre>';

Results

Array
(
    [0] => Array
        (
            [4] => 100 main st suite 100
            [3] => 100 main street ste 100
        )

    [1] => Array
        (
            [2] => 1000 somewhere St
            [1] => 1000 somewhere Street
        )

    [2] => Array
        (
            [6] => 206 Station Road
            [7] => 306 Station Road
        )

)
  • Like 1
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.