arewhyfour Posted March 23, 2008 Share Posted March 23, 2008 I'm pulling in feeds from multiple sites that sometimes have duplicate information between sites. I'm inserting them all into a database, here is the information I'm inserting: (Feed 1) # Ric Flair On His Loyalty To WWE, His Greatest Opponent (Feed 2) # Ric Flair On His Loyalty To WWE, Says Sting's Not His Greatest Opponent, His Future (Feed 1) # WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More (Feed 2) # WrestleMania Events, Austin/Hatton, WWE Teams With NHL (Feed 1) # Consequences Creed's Return To TNA, Tonight's TNA House Show, More Notes (Feed 2) # Consequences Creed's Return, TNA House Show, More Notes As you can tell, the titles are quite similar, but yet very different. I need some way of running a sql query that'll clean this duplicate information up. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/ Share on other sites More sharing options...
AdRock Posted March 23, 2008 Share Posted March 23, 2008 I don't know if full text search would help. Either that, or %LIKE% Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499004 Share on other sites More sharing options...
p2grace Posted March 23, 2008 Share Posted March 23, 2008 Another idea would be splitting each word into an array, and see if other strings have a given number of identical words. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499011 Share on other sites More sharing options...
arewhyfour Posted March 23, 2008 Author Share Posted March 23, 2008 I need to check it against a database that's currently at 3,000 records and grows at about 300 a day. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499022 Share on other sites More sharing options...
p2grace Posted March 23, 2008 Share Posted March 23, 2008 If you wrote the algorithm correctly it should be able to handle it though. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499024 Share on other sites More sharing options...
arewhyfour Posted March 23, 2008 Author Share Posted March 23, 2008 It'd be too much going on for that method. I'd have to pull results from the table, remove stop words, split the names into an array. Then I'd have to pull another listing with everything in that table and for each item I came across, split into an array and check against the item above. So, for each article I pull, I'd be choking through about 3,000 results. So that's a possibility of pulling 9,000,000 rows at once. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499032 Share on other sites More sharing options...
p2grace Posted March 23, 2008 Share Posted March 23, 2008 You don't need to perform the query like that. You'd need to write an actual algorithm that would run through the database more cleanly. It's either using %LIKE%, creating an algorithm, or doing it manually. I can't think off the top of my head another way of doing it. Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499073 Share on other sites More sharing options...
Barand Posted March 24, 2008 Share Posted March 24, 2008 fulltext looks favourite I set up a table "feeds" containing your feed 1 items - fulltext index on item [pre] +---------+-------------------------------------------------------------------------+ | idfeeds | item | +---------+-------------------------------------------------------------------------+ | 1 | Ric Flair On His Loyalty To WWE, His Greatest Opponent | | 2 | WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More | | 3 | Consequences Creed's Return To TNA, Tonight's TNA House Show, More notes| +---------+-------------------------------------------------------------------------+ [pre] then processed your feed 2 items plus an extra <?php include 'db.php'; $newfeeds = array ( "Ric Flair On His Loyalty To WWE, Says Sting's Not His Greatest Opponent, His Future", "WrestleMania Events, Austin/Hatton, WWE Teams With NHL", "Consequences Creed's Return, TNA House Show, More Notes", "George Bush Resigns over Rice Affair" ); foreach ($newfeeds as $news) { $news = addslashes($news); $sql = "SELECT COUNT(*) as matches FROM feeds WHERE MATCH (item) AGAINST ('$news') > 0"; $res = mysql_query($sql) or die (mysql_error()); if (mysql_result($res,0)==0) { mysql_query("INSERT INTO feeds (item) VALUES ('$news')"); } } ?> new data [pre] +---------+--------------------------------------------------------------------------+ | idfeeds | item | +---------+--------------------------------------------------------------------------+ | 1 | Ric Flair On His Loyalty To WWE, His Greatest Opponent | | 2 | WrestleMania Festivities, Austin/Hatton, WWE Teams With NHL Team, More | | 3 | Consequences Creed's Return To TNA, Tonight's TNA House Show, More Notes | | 4 | George Bush Resigns over Rice Affair | +---------+--------------------------------------------------------------------------+ Link to comment https://forums.phpfreaks.com/topic/97520-need-to-find-similar-rows-in-mysql/#findComment-499096 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.