Jump to content

[SOLVED] preg_replace to remove whole words in string


Omzy

Recommended Posts

Basically I'm trying to dynamically generate <meta keywords> tag.

 

Let's say I got a string like this:

 

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices."

 

And I got a list of stopwords which I have put in to a variable $stopwords:

 

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";

 

So I basically want to recreate $description with those stopwords taken out of it. How do I do this? I tried using preg_replace but it would only do partial-word matches...

Link to comment
Share on other sites

You know google gives you minus points if you have more than 5 meta keywords right? - I'd make a more advanced script that would search the string for the most relevant words, and pick 5 of those :)

Link to comment
Share on other sites

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices.";

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";
$stopwordsArray=explode('|',$stopwords);

$descriptionArray = str_word_count($description,2);
foreach($descriptionArray as $descriptionWordKey => $descriptionWord) {
if (in_array($descriptionWord,$stopwordsArray)) {
	unset($descriptionArray[$descriptionWordKey]);
}
}
$descriptionArray = array_unique($descriptionArray);

print_r($descriptionArray);

You might also want to drop words of length 1

Link to comment
Share on other sites

You can also use regular expressions:

 

<?php
$description = "SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices."
$stopwords = "in|the|of|to|which|where|how|is|it|if|why|who";
$stopwords = explode('|', $stopwords);
$patterns = array();
foreach ($stopwords as $stopword) {
$patterns[] = '~\b' . preg_quote($stopword, '~') . '\b~i';
}
$description = preg_replace($patterns, '', $description);
//replace whitespace(s) with a single space
$description = preg_replace('~\s+~', ' ', $description);
?>

 

But Mark Baker's method is probably faster.

Link to comment
Share on other sites

Right I managed to figure that out, I've now got it to display the 10 most popular words on the page, I did this using array_slice, array_count_values and arsort.

 

Mark also mentioned above "You might also want to drop words of length 1" - how can I do this?

Link to comment
Share on other sites

Right I managed to figure that out, I've now got it to display the 10 most popular words on the page, I did this using array_slice, array_count_values and arsort.

 

Mark also mentioned above "You might also want to drop words of length 1" - how can I do this?

$description="SanDisk’s Memory Stick PRO Duo is half the size of a standard-size Memory Stick PRO media and it offers the same technologies including high speed data transfer, built-in MagicGate, and high capacities. The Memory Stick PRO Duo is the ideal solution for the most portable devices such as pocket-size digital cameras and with the use of Adaptor, it can be used in all PRO-compatible devices.";

$stopwords="in|the|of|to|which|where|how|is|it|if|why|who";
$stopwordsArray=explode('|',$stopwords);

$descriptionArray = $wordfrequency = array_count_values( str_word_count( $description, 1) );
foreach($descriptionArray as $descriptionWordKey => $descriptionWord) {
if ((in_array($descriptionWordKey,$stopwordsArray)) || (strlen($descriptionWordKey) == 1)) {
	unset($descriptionArray[$descriptionWordKey]);
}
}
arsort($descriptionArray);

print_r($descriptionArray);

Note that the word is now the array key, and the value is the number of occurrences in the description

Link to comment
Share on other sites

Think there might be an error there, it didn't seem to work for me and I noticed that $wordfrequency is only referenced once in the code...

$wordfrequency is redundant, a variable that's populated but never used, and so it's irrelevant. It can be removed without affecting the code in any way.

 

If it isn't working for you, what errors (if any) are you getting? Or what are you expecting to see an not seeing?

The output I'm getting is:

Array ( [and] => 3 [PRO] => 3 [stick] => 3 [Memory] => 3 [devices] => 2 [high] => 2 [Duo] => 2 [pocket-size] => 1 [digital] => 1 [cameras] => 1 [portable] => 1 [most] => 1 [such] => 1 [as] => 1 [Adaptor] => 1 [used] => 1 [all] => 1 [PRO-compatible] => 1 [be] => 1 [can] => 1 [use] => 1 [for] => 1 [with] => 1 [capacities] => 1 [offers] => 1 [same] => 1 [technologies] => 1 [media] => 1 [standard-size] => 1 [half] => 1 [size] => 1 [including] => 1 [speed] => 1 [sanDisk] => 1 [The] => 1 [ideal] => 1 [MagicGate] => 1 [built-in] => 1 [data] => 1 [transfer] => 1 [solution] => 1 )

which seems to tally up when I do the counts manually

 

The only quibble I've noted is that it's case-sensitive, so "The" is counted even though "the" is in the $stopwords list.

This can be fixed by changing

in_array($descriptionWordKey,$stopwordsArray)

to

in_array(strtolower($descriptionWordKey),$stopwordsArray)

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.