Jump to content

[SOLVED] Matching similar records


roopurt18

Recommended Posts

I have the following table structure for a web form that captures user information:

 

CREATE TABLE IF NOT EXISTS `iview_response`(

  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,

  `entered` DATETIME NOT NULL,

  `exported` TINYINT(1) NOT NULL DEFAULT 0

);

 

CREATE TABLE IF NOT EXISTS `iview_vitals`(

  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,

  `response` INT NOT NULL DEFAULT 0,

  `fname` VARCHAR(24) NOT NULL DEFAULT '',

  `lname` VARCHAR(24) NOT NULL DEFAULT '',

  `addr1` VARCHAR(32) NOT NULL DEFAULT '',

  `addr2` VARCHAR(32) NOT NULL DEFAULT '',

  `city` VARCHAR(32) NOT NULL DEFAULT '',

  `state` VARCHAR(2) NOT NULL DEFAULT '',

  `zip` VARCHAR(5) NOT NULL DEFAULT '',

  `hphone` VARCHAR(17) NOT NULL DEFAULT '',

  `wphone` VARCHAR(17) NOT NULL DEFAULT '',

  `cphone` VARCHAR(17) NOT NULL DEFAULT '',

  `fax` VARCHAR(17) NOT NULL DEFAULT ''

);

 

CREATE TABLE IF NOT EXISTS `iview_emails`(

  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,

  `response` INT NOT NULL DEFAULT 0,

  `lpart` VARCHAR(128) NOT NULL DEFAULT '',

  `domain` VARCHAR(255) NOT NULL DEFAULT '',

  UNIQUE( `response`, `lpart`, `domain` )

);

 

iview_emails exists as a separate table to allow people to enter as many e-mail addresses as they'd like when filling out a form.

 

These three tables are intended to capture marketing information so users can potentially fill out the form multiple times.  I can't place any unique constraints on the fields because marketing being marketing, they don't want to reject any information.

 

So I have to build a "deduping" report that displays similar records that scales relatively well.  Anyone have any recommendations on the best way to check for similar, but not exact, records within a table?

 

 

Link to comment
Share on other sites

I wanted to pop back in and explain what I did and mark this as solved just in case someone runs into a similar issue in the future.

 

I created a page with a form that has two fields.  One field is a drop down of the possible columns in the table, such as fname, lname, addr1, etc;  The drop down displays user friendly descriptions but maps to those values behind the scenes.  The other field is a drop down with the choices "Exact" or "Sounds Like."  Each time this form is submitted it runs the query, which I'll talk about in a minute, but then redisplays itself with two new drop downs to select another field-match type combination.

 

So for example, when first presented with the form I can specify to find duplicates where the states are exact matches.  I can then further clarify that I want cities that are exact matches.  Lastly, I can add a filter where the last names or phone numbers or whatever sound similar.

 

I've tested this on data generated by the data generator in the sticky on this board and some manual entries as well; it seems to work well but it hasn't faced any real-world scenarios yet.

 

As for the query itself, I'm actually running quite a few.

 

As I stated before, I have three tables: iview_response, iview_vitals, iview_emails.  iview_response is the master record created whenever a user submits a form.  iview_vitals contains information such as first name, last name, address, and contact information.  I chose to store the emails in a separate table with fields "local_part" and "domain" so that users could enter multiple e-mail addresses per response.

 

The query suggested by Barand is fairly straight forward when the results are stored in a single table but becomes more complicated with my table setup.  So here is what I do with the database:

 

1) Create a temporary table based off of iview_vitals.

 

2) Alter the temp table and add an "email" column with type TINYTEXT, since an e-mail can be longer than 255 characters.

 

3) Do an INSERT ... SELECT into the temp table, selecting from iview_vitals and iview_emails.  In this step, the e-mail is converted back into a "local_part@domain" form.  However, if someone enters multiple e-mail addresses, the temp table will have duplicate rows where only the "email" field differs; we must make a note of this.

 

4) Add a FULLTEXT index on the email column in the temp table.

 

The query we intend to run has to include the same table twice; however, temporary tables can appear only a single time in a query, so we...

 

5) Create a second temp table and immediately select from the first temp table; now we have two identical temporary tables.

 

6) Run a form of the query suggested by Barand.

 

It's important to note that we do not want to compare the same record to itself, which would always report a duplicate.  We also only want to compare the records in a single direction, i.e. once we compare record 1 to record 2, we don't need to compare record 2 to record 1.  This is why Barand included:

WHERE a.id < b.id

in his query.

 

My temp tables contain a column "response_id" which is the master record the vitals / emails belong to.  I stated earlier we had to take note that the temp table would have duplicate rows where users entered multiple e-mail addresses.  To ensure that records are compared in a single direction and that vitals belonging to the same response weren't compared with each other, I changed my WHERE part to be:

WHERE a.response_id < b.response_id

 

I seeded my tables with 200 records, which is far less than most of our clients will have for quite some time as far as I can tell, and it runs quickly.  I'm hoping it scales.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.