axelheyst Posted January 7, 2008 Share Posted January 7, 2008 ============================== THE PROBLEM ============================== I am currently dealing with lots of messy data in let's say a SQL database. Within this database there exist two entirely unique and silo'ed tables of account information. In both each account had a unique text based name as well as account number. The problem lies in trying merge the two into one .... Where things get tricky is that the data for a single account often looks like this: Table / Name / Number Table 1 / Starbucks / 9373423 Table 1 / Starbucks Corporation / 123456 Table 1 / Starbucks, Inc. / 0091233 Table 2 / STAR BUCKS GROUP INC / 34234 (notice the space) Table 2 / Starbucks, Corporation Inc. / 01393123 The point I make is that multiple account names exist as duplicates within their own table as well as within the separate table. ============================== MY EFFORTS SO FAR ============================== To date I have done this .... I wrote a script that removes all punctuation, makes everything lowercase, removes words like "Inc" "Group" and Corporation" Then this script looks for an exact match. This works to some extent. Where it fails is merging the "STAR BUCKS" from table 2 with say the "Starbucks from table 1" .... to tackle this challenge I did this, For any phrase that has multiple components (ie. STAR + BUCKS = two components) I search for account names that contain one or more of these components, then I rank weighting matches with a great percentage of the total components higher. This also helps to some extent ... but Now you start getting matches for "STAR BUCKS" with a company like "StarAlliance, Inc" or "Bucks Hunting Goods" ============================== YOUR THOUGHTS AND TIPS / SUGGESTIONS ============================== Have you dealt with nasty data merging like this before? How did you handle it? Is there a standard method or procedure for text based cleansing of messy data. Thank you very much for taking the time to read this post. -A Quote Link to comment https://forums.phpfreaks.com/topic/84921-text-based-matching-to-clean-up-messy-data-a-challenge-ie-question/ Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.