Jump to content

Text Based Matching to Clean up Messy Data - A Challenge (ie. Question)


axelheyst

Recommended Posts

==============================

THE PROBLEM

==============================

I am currently dealing with lots of messy data in let's say a SQL database. Within this database there exist two entirely unique and silo'ed tables of account information. In both each account had a unique text based name as well as account number. The problem lies in trying merge the two into one ....

 

Where things get tricky is that the data for a single account often looks like this:

 

Table / Name / Number

 

Table 1 / Starbucks / 9373423

Table 1 / Starbucks Corporation / 123456

Table 1 / Starbucks, Inc. / 0091233

 

Table 2 / STAR BUCKS GROUP INC / 34234 (notice the space)

Table 2 / Starbucks, Corporation Inc. / 01393123

 

The point I make is that multiple account names exist as duplicates within their own table as well as within the separate table.

 

==============================

MY EFFORTS SO FAR

==============================

To date I have done this .... I wrote a script that removes all punctuation, makes everything lowercase, removes words like "Inc" "Group" and Corporation"

 

Then this script looks for an exact match. This works to some extent. Where it fails is merging the "STAR BUCKS" from table 2 with say the "Starbucks from table 1"

 

.... to tackle this challenge I did this, For any phrase that has multiple components (ie. STAR + BUCKS = two components) I search for account names that contain one or more of these components, then I rank weighting matches with a great percentage of the total components higher. This also helps to some extent ... but

 

Now you start getting matches for "STAR BUCKS" with a company like "StarAlliance, Inc" or "Bucks Hunting Goods"

 

==============================

YOUR THOUGHTS AND TIPS / SUGGESTIONS

==============================

Have you dealt with nasty data merging like this before? How did you handle it? Is there a standard method or procedure for text based cleansing of messy data.

 

Thank you very much for taking the time to read this post.

-A

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.