Jump to content

comparing two text documents


zintani

Recommended Posts

Hello,

Actually I would like to ask if there is a way to remove the stop words (the, and, is, are,..,etc) from a text without removing other words such as (stand) where the last three letters (and) are removed when to use replace function.

After that, I managed to bring text documents saved in a database Mysql up and I would like to know if there is a way to compare the similarities between those documents.

Link to comment
Share on other sites

You can use regular expressions to strip out stopwords by making use of the \b character class:

 

 

php > $a = "Sandy and the band played at Dave and Buster's";
php > echo str_replace('and', '', $a);
Sy  the b played at Dave  Buster's
php > echo preg_replace('/\band\b/i', '', $a);
Sandy  the band played at Dave  Buster's
php > 

-Dan

Link to comment
Share on other sites

$old_string = "Sandy and the band played at Dave And Buster's the other day.";

//Create an array of the words to remove
$patterns = array('and', 'the', 'at');
//Convert patterns to have word boundaries and be case insensitive
foreach($patterns as &$val) { $val = "#\b{$val}\b#i"; }
//Create replacements array
$replacements = array_fill(0, count($patterns), '');
//Make replacements
$new_string = preg_replace($patterns, $replacements, $old_string);

echo "$old_string<br>\n"; //Sandy and the band played at Dave And Buster's the other day.
echo "$new_string<br>\n"; //Sandy band played Dave Buster's other day.

Link to comment
Share on other sites

Thanks mjdamato,

I was doing the same idea for the array.

$more = array ('/\ba\b/i','/\babout\b/i','/\babove\b/i','/\bacross\b/i','/\bafter\b/i','/\bagain\b/i',

'/\bagainst\b/i','/\ball\b/i','/\balmost\b/i','/\balone\b/i','/\balong\b/i','/\balready\b/i',

'/\balso\b/i','/\balthough\b/i','/\balways\b/i','/\bamong\b/i','/\ban\b/i','/\band\b/i','/\banother\b/i',

'/\bany\b/i','/\banybody\b/i','/\banyone\b/i','/\banything\b/i','/\banywhere\b/i','/\bare\b/i','/\barea\b/i',

'/\bareas\b/i') ; which was time consuming and I wanted to add /bcharacter/b automatically and here your code.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.