Jump to content

Text Based Matching to Clean up Messy Data - A Challenge (ie. Question)


Recommended Posts

==============================

THE PROBLEM

==============================

I am currently dealing with lots of messy data in let's say a SQL database. Within this database there exist two entirely unique and silo'ed tables of account information. In both each account had a unique text based name as well as account number. The problem lies in trying merge the two into one ....

 

Where things get tricky is that the data for a single account often looks like this:

 

Table / Name / Number

 

Table 1 / Starbucks / 9373423

Table 1 / Starbucks Corporation / 123456

Table 1 / Starbucks, Inc. / 0091233

 

Table 2 / STAR BUCKS GROUP INC / 34234 (notice the space)

Table 2 / Starbucks, Corporation Inc. / 01393123

 

The point I make is that multiple account names exist as duplicates within their own table as well as within the separate table.

 

==============================

MY EFFORTS SO FAR

==============================

To date I have done this .... I wrote a script that removes all punctuation, makes everything lowercase, removes words like "Inc" "Group" and Corporation"

 

Then this script looks for an exact match. This works to some extent. Where it fails is merging the "STAR BUCKS" from table 2 with say the "Starbucks from table 1"

 

.... to tackle this challenge I did this, For any phrase that has multiple components (ie. STAR + BUCKS = two components) I search for account names that contain one or more of these components, then I rank weighting matches with a great percentage of the total components higher. This also helps to some extent ... but

 

Now you start getting matches for "STAR BUCKS" with a company like "StarAlliance, Inc" or "Bucks Hunting Goods"

 

==============================

YOUR THOUGHTS AND TIPS / SUGGESTIONS

==============================

Have you dealt with nasty data merging like this before? How did you handle it? Is there a standard method or procedure for text based cleansing of messy data.

 

Thank you very much for taking the time to read this post.

-A

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.