Jump to content

Need to find "similar" rows in MySQL


arewhyfour

Recommended Posts

I'm pulling in feeds from multiple sites that sometimes have duplicate information between sites. I'm inserting them all into a database, here is the information I'm inserting:

 

 

(Feed 1) # Ric Flair On His Loyalty To WWE, His Greatest Opponent

(Feed 2) # Ric Flair On His Loyalty To WWE, Says Sting's Not His Greatest Opponent, His Future

 

(Feed 1) # WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More

(Feed 2) # WrestleMania Events, Austin/Hatton, WWE Teams With NHL

 

(Feed 1) # Consequences Creed's Return To TNA, Tonight's TNA House Show, More Notes

(Feed 2) # Consequences Creed's Return, TNA House Show, More Notes

 

As you can tell, the titles are quite similar, but yet very different.

 

I need some way of running a sql query that'll clean this duplicate information up.

Link to comment
https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/
Share on other sites

It'd be too much going on for that method.

 

I'd have to pull results from the table, remove stop words, split the names into an array.

 

Then I'd have to pull another listing with everything in that table and for each item I came across, split into an array and check against the item above.

 

So, for each article I pull, I'd be choking through about 3,000 results. So that's a possibility of pulling 9,000,000 rows at once.

You don't need to perform the query like that.  You'd need to write an actual algorithm that would run through the database more cleanly.  It's either using %LIKE%, creating an algorithm, or doing it manually.  I can't think off the top of my head another way of doing it.

fulltext looks favourite

 

I set up a table "feeds" containing your feed 1 items - fulltext index on item

[pre]

+---------+-------------------------------------------------------------------------+

| idfeeds | item                                                                    |

+---------+-------------------------------------------------------------------------+

|      1 | Ric Flair On His Loyalty To WWE, His Greatest Opponent                  |

|      2 | WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More  |

|      3 | Consequences Creed's Return To TNA, Tonight's TNA House Show, More notes|

+---------+-------------------------------------------------------------------------+

[pre]

 

then processed your feed 2 items plus an extra

<?php
include 'db.php';

$newfeeds = array (
                "Ric Flair On His Loyalty To WWE, Says Sting's Not His Greatest Opponent, His Future",
                "WrestleMania Events, Austin/Hatton, WWE Teams With NHL",
                "Consequences Creed's Return, TNA House Show, More Notes",
                "George Bush Resigns over Rice Affair"
            );
            
foreach ($newfeeds as $news)
{
    $news = addslashes($news);
    $sql = "SELECT COUNT(*) as matches FROM feeds
            WHERE MATCH (item) AGAINST ('$news') > 0";
    $res = mysql_query($sql) or die (mysql_error());
    if (mysql_result($res,0)==0)
    {
        mysql_query("INSERT INTO feeds (item) VALUES ('$news')");
    }
}
?>

 

new data

[pre]

+---------+--------------------------------------------------------------------------+

| idfeeds | item                                                                    |

+---------+--------------------------------------------------------------------------+

|      1 | Ric Flair On His Loyalty To WWE, His Greatest Opponent                  |

|      2 | WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More  |

|      3 | Consequences Creed's Return To TNA, Tonight's TNA House Show, More Notes |

|      4 | George Bush Resigns over Rice Affair                                    |

+---------+--------------------------------------------------------------------------+

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.