Jump to content

comparing two text documents


zintani

Recommended Posts

Hello,

Actually I would like to ask if there is a way to remove the stop words (the, and, is, are,..,etc) from a text without removing other words such as (stand) where the last three letters (and) are removed when to use replace function.

After that, I managed to bring text documents saved in a database Mysql up and I would like to know if there is a way to compare the similarities between those documents.

Link to comment
https://forums.phpfreaks.com/topic/247438-comparing-two-text-documents/
Share on other sites

You can use regular expressions to strip out stopwords by making use of the \b character class:

 

 

php > $a = "Sandy and the band played at Dave and Buster's";
php > echo str_replace('and', '', $a);
Sy  the b played at Dave  Buster's
php > echo preg_replace('/\band\b/i', '', $a);
Sandy  the band played at Dave  Buster's
php > 

-Dan

$old_string = "Sandy and the band played at Dave And Buster's the other day.";

//Create an array of the words to remove
$patterns = array('and', 'the', 'at');
//Convert patterns to have word boundaries and be case insensitive
foreach($patterns as &$val) { $val = "#\b{$val}\b#i"; }
//Create replacements array
$replacements = array_fill(0, count($patterns), '');
//Make replacements
$new_string = preg_replace($patterns, $replacements, $old_string);

echo "$old_string<br>\n"; //Sandy and the band played at Dave And Buster's the other day.
echo "$new_string<br>\n"; //Sandy band played Dave Buster's other day.

Thanks mjdamato,

I was doing the same idea for the array.

$more = array ('/\ba\b/i','/\babout\b/i','/\babove\b/i','/\bacross\b/i','/\bafter\b/i','/\bagain\b/i',

'/\bagainst\b/i','/\ball\b/i','/\balmost\b/i','/\balone\b/i','/\balong\b/i','/\balready\b/i',

'/\balso\b/i','/\balthough\b/i','/\balways\b/i','/\bamong\b/i','/\ban\b/i','/\band\b/i','/\banother\b/i',

'/\bany\b/i','/\banybody\b/i','/\banyone\b/i','/\banything\b/i','/\banywhere\b/i','/\bare\b/i','/\barea\b/i',

'/\bareas\b/i') ; which was time consuming and I wanted to add /bcharacter/b automatically and here your code.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.