Jump to content

[SOLVED] preg_replace to remove whole words in string


Omzy

Recommended Posts

Basically I'm trying to dynamically generate <meta keywords> tag.

 

Let's say I got a string like this:

 

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices."

 

And I got a list of stopwords which I have put in to a variable $stopwords:

 

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";

 

So I basically want to recreate $description with those stopwords taken out of it. How do I do this? I tried using preg_replace but it would only do partial-word matches...

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices.";

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";
$stopwordsArray=explode('|',$stopwords);

$descriptionArray = str_word_count($description,2);
foreach($descriptionArray as $descriptionWordKey => $descriptionWord) {
if (in_array($descriptionWord,$stopwordsArray)) {
	unset($descriptionArray[$descriptionWordKey]);
}
}
$descriptionArray = array_unique($descriptionArray);

print_r($descriptionArray);

You might also want to drop words of length 1

You can also use regular expressions:

 

<?php
$description = "SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices."
$stopwords = "in|the|of|to|which|where|how|is|it|if|why|who";
$stopwords = explode('|', $stopwords);
$patterns = array();
foreach ($stopwords as $stopword) {
$patterns[] = '~\b' . preg_quote($stopword, '~') . '\b~i';
}
$description = preg_replace($patterns, '', $description);
//replace whitespace(s) with a single space
$description = preg_replace('~\s+~', ' ', $description);
?>

 

But Mark Baker's method is probably faster.

Right I managed to figure that out, I've now got it to display the 10 most popular words on the page, I did this using array_slice, array_count_values and arsort.

 

Mark also mentioned above "You might also want to drop words of length 1" - how can I do this?

Right I managed to figure that out, I've now got it to display the 10 most popular words on the page, I did this using array_slice, array_count_values and arsort.

 

Mark also mentioned above "You might also want to drop words of length 1" - how can I do this?

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices.";

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";
$stopwordsArray=explode('|',$stopwords);

$descriptionArray = $wordfrequency = array_count_values( str_word_count( $description, 1) );
foreach($descriptionArray as $descriptionWordKey => $descriptionWord) {
if ((in_array($descriptionWordKey,$stopwordsArray)) || (strlen($descriptionWordKey) == 1)) {
	unset($descriptionArray[$descriptionWordKey]);
}
}
arsort($descriptionArray);

print_r($descriptionArray);

Note that the word is now the array key, and the value is the number of occurrences in the description

Think there might be an error there, it didn't seem to work for me and I noticed that $wordfrequency is only referenced once in the code...

$wordfrequency is redundant, a variable that's populated but never used, and so it's irrelevant. It can be removed without affecting the code in any way.

 

If it isn't working for you, what errors (if any) are you getting? Or what are you expecting to see an not seeing?

The output I'm getting is:

Array ( [and] => 3 [PRO] => 3 [stick] => 3 [Memory] => 3 [devices] => 2 [high] => 2 [Duo] => 2 [pocket-size] => 1 [digital] => 1 [cameras] => 1 [portable] => 1 [most] => 1 [such] => 1 [as] => 1 [Adaptor] => 1 [used] => 1 [all] => 1 [PRO-compatible] => 1 [be] => 1 [can] => 1 [use] => 1 [for] => 1 [with] => 1 [capacities] => 1 [offers] => 1 [same] => 1 [technologies] => 1 [media] => 1 [standard-size] => 1 [half] => 1 [size] => 1 [including] => 1 [speed] => 1 [sanDisk] => 1 [The] => 1 [ideal] => 1 [MagicGate] => 1 [built-in] => 1 [data] => 1 [transfer] => 1 [solution] => 1 )

which seems to tally up when I do the counts manually

 

The only quibble I've noted is that it's case-sensitive, so "The" is counted even though "the" is in the $stopwords list.

This can be fixed by changing

in_array($descriptionWordKey,$stopwordsArray)

to

in_array(strtolower($descriptionWordKey),$stopwordsArray)

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.