tobeyt23 Posted February 26, 2015 Share Posted February 26, 2015 (edited) I have a table with addresses in them and need to find the duplicates. Some address will be like 1000 somewhere st or 1000 somewhere Street or 100 main street ste 100 or 100 main st Suite 100. How can i match them up? table would be like this: id address city state zip Thanks Edited February 26, 2015 by tobeyt23 Quote Link to comment Share on other sites More sharing options...
mac_gyver Posted February 26, 2015 Share Posted February 26, 2015 afaik, there's no built in way this could be done directly by the database. you would need to normalize the addresses, by removing punctuation, converting different white-space characters to a single type (a space), converting groups of multiple white-space characters to a single one, and then replacing abbreviations with the whole words. at that point, barring typo errors in the original data, duplicates would be identical. for the case of typo's, you could then use SOUNDEX() to find same sounding addresses (doesn't test numbers though.) you could also use a levenshtein function (not available in mysql, but could be created as a user written function) to find how many characters are different between two normalized addresses. this would take into account the numbers in the address, whereas SOUNDEX() only operates on alphabetical characters. Quote Link to comment Share on other sites More sharing options...
Barand Posted February 26, 2015 Share Posted February 26, 2015 This how you might use levenshtein. As in the last result you could get false positives but you could probably lessen the occurence by taking zip into account also. mysql> SELECT * FROM table1 ORDER BY address; +----+-------------------------+ | id | address | +----+-------------------------+ | 4 | 100 main st suite 100 | | 3 | 100 main street ste 100 | | 2 | 1000 somewhere St | | 1 | 1000 somewhere Street | | 5 | 150 Granby Avenue | | 6 | 206 Station Road | | 7 | 306 Station Road | +----+-------------------------+ $threshold = 6; // change to alter the sensitivity $results = array(); $sql = "SELECT id, address FROM table1 ORDER BY address"; $res = $db->query($sql); $prev = $previd = ''; while (list($id, $add) = $res->fetch_row()) { if (levenshtein($prev,$add) <= $threshold) { $results[] = array( $previd=>$prev, $id=>$add ); } $prev = $add; $previd = $id; } echo '<pre>',print_r($results, true),'</pre>'; Results Array ( [0] => Array ( [4] => 100 main st suite 100 [3] => 100 main street ste 100 ) [1] => Array ( [2] => 1000 somewhere St [1] => 1000 somewhere Street ) [2] => Array ( [6] => 206 Station Road [7] => 306 Station Road ) ) 1 Quote Link to comment Share on other sites More sharing options...
maxxd Posted February 27, 2015 Share Posted February 27, 2015 Random aside here - Barand, I'd never (as far as I can recall, anyway) heard of the levenshtein() function. Thank you - that's kinda awesome! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.