Jump to content

Similarity search


JonasLAX

Recommended Posts

Dear all,

 

I have the following problem. I have a lot of data from different sources in the same table. In a lot of cases there are duplicate entries but with slightly different names. For example "Lord of the Rings" and "The Lord of the Rings". Now I assume that it will probably be almost impossible to automatically select these records as exact matches. But what I would like to do is to create a list with suspected duplicates and then manually go through this list and decide which entries are duplicates and which aren't.

 

The table currently has about 35.000 records so the algorithm also has to be sort of smart and use a technique more sophisticated than simple brute force. Can anybody show me the direction I should be looking at?

 

Thanks,

Jonas

Link to comment
Share on other sites

You have a data quality issue my friend. Your efforts need to be focused on cleaning up your data rather than trying to accept it. However that being said the best you can do with your situation is write some queries with a LIKE clause.

Link to comment
Share on other sites

Hi dbo,

 

Thanks for your reply. Definitely there is a data quality issue here. The problem is that the data is aggregated other a wide range of sources which lie beyond our control. Therefore, we are in the unfortunate position of having to find a way to cope with the current input and try and clean it up.

 

Any special suggestions/best practices on how to structure the LIKE clauses?

Link to comment
Share on other sites

It's not going to be good. You'll have to build some sort of a ranking tool and you'll suffer huge performance impacts as a result. For example:

 

lord of the rings

the lord of the rings

rings of our lord

 

The third probably has nothing to do with the other two, but how do you rank that? You'll have to build some sort of matrix. Perhaps you get a certain number of "points" for having the same number of words and then you lose points for the +/-. The number of words in the correct positions (also accounting ro an offset, ie missing "the"). You'll have to figure out what those priorities and weights are but even then it's not going to be perfect.

Link to comment
Share on other sites

  • 4 weeks later...

But I can't be the first one ever to bump into this problem?! There must be a nice solution for non-Googles or not?

 

I hope you have found a solution to your data problem. I have similar issue myself. What you are looking for is not php or web application specific. It has to do with database and data quality. You would need to clean data before using.

 

A good database design would help in future data quality issues.

 

Cleaning Data:

You can use data integration or manual cleaning to clean your data depending on how much cleaning and scrubbing is involved. Some data integration tools have fuzzy search components/modules to handle this type of issue.

 

You would most likely need a look up table or more.

 

Once your data is all cleaned and used in your application, you would need to ensure the data sources also follow certain rules when inputting data. Or using only single point of data entry. If you are not going to import data from these sources again, then you probably need not have to worry about this.

 

To clean the data, I would follow something like this:

1.) Make a look up table for the data -- the table would contain data in correct format.

2.) You would also add several columns to this table that hold aliases: alias1, alias2,....

3.) For many situations using this de-normalized table structure might be OK.

4.) Cleaning has to be done by comparing each row from the table you need cleaned. Comparing each row with the look up table (using the main column, or aliases), you would need filter the rows into 3 outputs -- one for exact matches, another for similar but not quiet (say the word order is different, or abbreviated words, etc.), and still another for non-matches.

5.) This should greatly reduce the number of rows that need cleaning. Manually clean the data in two outputs -- partial matches and non-matches.

 

Link to comment
Share on other sites

What you should do, is to create some script to

1. read data from db to memory (since it is not that huge number of records)

2. calculate similarity between entries (that is called minimal edit distance, you can find algorithms at http://en.wikipedia.org/wiki/Edit_distance) - better to use some heuristics when calculating this (for example only for strings that begin with same later, if starting with different letters you can say that edit distance is some big number like 99) - this can take considerable time

3. dump results in some text file with edit distance as first column

4. sort that file using sort command if you are using some unix

5. look from beginning of the file since edit distance will be sorted in increasing order

 

Hope that will help

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.