JonasLAX Posted May 12, 2009 Share Posted May 12, 2009 Dear all, I have the following problem. I have a lot of data from different sources in the same table. In a lot of cases there are duplicate entries but with slightly different names. For example "Lord of the Rings" and "The Lord of the Rings". Now I assume that it will probably be almost impossible to automatically select these records as exact matches. But what I would like to do is to create a list with suspected duplicates and then manually go through this list and decide which entries are duplicates and which aren't. The table currently has about 35.000 records so the algorithm also has to be sort of smart and use a technique more sophisticated than simple brute force. Can anybody show me the direction I should be looking at? Thanks, Jonas Quote Link to comment Share on other sites More sharing options...
dbo Posted May 12, 2009 Share Posted May 12, 2009 You have a data quality issue my friend. Your efforts need to be focused on cleaning up your data rather than trying to accept it. However that being said the best you can do with your situation is write some queries with a LIKE clause. Quote Link to comment Share on other sites More sharing options...
JonasLAX Posted May 12, 2009 Author Share Posted May 12, 2009 Hi dbo, Thanks for your reply. Definitely there is a data quality issue here. The problem is that the data is aggregated other a wide range of sources which lie beyond our control. Therefore, we are in the unfortunate position of having to find a way to cope with the current input and try and clean it up. Any special suggestions/best practices on how to structure the LIKE clauses? Quote Link to comment Share on other sites More sharing options...
dbo Posted May 12, 2009 Share Posted May 12, 2009 It's not going to be good. You'll have to build some sort of a ranking tool and you'll suffer huge performance impacts as a result. For example: lord of the rings the lord of the rings rings of our lord The third probably has nothing to do with the other two, but how do you rank that? You'll have to build some sort of matrix. Perhaps you get a certain number of "points" for having the same number of words and then you lose points for the +/-. The number of words in the correct positions (also accounting ro an offset, ie missing "the"). You'll have to figure out what those priorities and weights are but even then it's not going to be perfect. Quote Link to comment Share on other sites More sharing options...
Ken2k7 Posted May 12, 2009 Share Posted May 12, 2009 Now imaging the awesomeness of the Google search engine relative to its speed. Quote Link to comment Share on other sites More sharing options...
JonasLAX Posted May 12, 2009 Author Share Posted May 12, 2009 But I can't be the first one ever to bump into this problem?! There must be a nice solution for non-Googles or not? Quote Link to comment Share on other sites More sharing options...
php_help Posted June 4, 2009 Share Posted June 4, 2009 But I can't be the first one ever to bump into this problem?! There must be a nice solution for non-Googles or not? I hope you have found a solution to your data problem. I have similar issue myself. What you are looking for is not php or web application specific. It has to do with database and data quality. You would need to clean data before using. A good database design would help in future data quality issues. Cleaning Data: You can use data integration or manual cleaning to clean your data depending on how much cleaning and scrubbing is involved. Some data integration tools have fuzzy search components/modules to handle this type of issue. You would most likely need a look up table or more. Once your data is all cleaned and used in your application, you would need to ensure the data sources also follow certain rules when inputting data. Or using only single point of data entry. If you are not going to import data from these sources again, then you probably need not have to worry about this. To clean the data, I would follow something like this: 1.) Make a look up table for the data -- the table would contain data in correct format. 2.) You would also add several columns to this table that hold aliases: alias1, alias2,.... 3.) For many situations using this de-normalized table structure might be OK. 4.) Cleaning has to be done by comparing each row from the table you need cleaned. Comparing each row with the look up table (using the main column, or aliases), you would need filter the rows into 3 outputs -- one for exact matches, another for similar but not quiet (say the word order is different, or abbreviated words, etc.), and still another for non-matches. 5.) This should greatly reduce the number of rows that need cleaning. Manually clean the data in two outputs -- partial matches and non-matches. Quote Link to comment Share on other sites More sharing options...
rivan Posted June 6, 2009 Share Posted June 6, 2009 What you should do, is to create some script to 1. read data from db to memory (since it is not that huge number of records) 2. calculate similarity between entries (that is called minimal edit distance, you can find algorithms at http://en.wikipedia.org/wiki/Edit_distance) - better to use some heuristics when calculating this (for example only for strings that begin with same later, if starting with different letters you can say that edit distance is some big number like 99) - this can take considerable time 3. dump results in some text file with edit distance as first column 4. sort that file using sort command if you are using some unix 5. look from beginning of the file since edit distance will be sorted in increasing order Hope that will help Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.