counting specific words in string (reliably)

gerkintrigg · November 30, 2009

What's the easiest way of counting specific words in a string?

If I had a string like this:

hear heard hearing heard <b>hear hear hear</b>hear

and counted them using any of the popular methods like substr_count I'd end up with anything from 3 to 8. I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing" but also to ensure that it works out when the tags are just before/after it and to pick up those words too... If I explode the string by a space, it won't pick up the words located right next to a tag.

i tried using preg_match and preg_match_all but can't work out how to count results from the matching.

Could someone please help?

Psycho · November 30, 2009

I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing"...

??? How are "heard" or "hearing" not words? If I was manually doing a word count I would count them as words. If you want code to not count different tenses of words you have a very, very long project on your hands. You will need to build an extensive dictionary of words and their tenses.

As for the html tags, I would suggest using preg_replace to remove any tags before doing the word counts (and also use it to remove any multiple spaces). Then do an explode to create an array of each word in the string. Then use array_unique() to end up only with the unique words. You will then need to create a custom function to remove different tenses of words.

oni-kun · November 30, 2009

Doesn't he mean just to match "[:space:]hear[:space:]" ? That basically is a reliable way, And you'd only need to define a few simple patterns such as (space)hear(?!) etc.

cags · November 30, 2009

I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing"...

??? How are "heard" or "hearing" not words? If I was manually doing a word count I would count them as words. If you want code to not count different tenses of words you have a very, very long project on your hands. You will need to build an extensive dictionary of words and their tenses.

As for the html tags, I would suggest using preg_replace to remove any tags before doing the word counts (and also use it to remove any multiple spaces). Then do an explode to create an array of each word in the string. Then use array_unique() to end up only with the unique words. You will then need to create a custom function to remove different tenses of words.

I believe the OP's objective is to simply count the number of instances of a specified word. In the example given that should be 'hear', which shouldn't match 'heard' or 'hearing'. It also shouldn't match anything inside a HTML tag such as '<div class="hear">', it just wasn't terribly well explained.

To count them you should simply need to use preg_match_all (using the word boundary solution already discussed in at least one other thread with the OP) and then to simply use count to count the number of items returned. To ignore the contents of tags, the simplest solution would be as mjdamato said, to strip the tags first (using strip_tags or a Regular Expression).

salathe · November 30, 2009

… then to simply use count to count the number of items returned.

Or, just look at the return value from preg_match_all which is the number of matches found. :shy:

cags · November 30, 2009

Well obviously if you wanted to do it the easy way...

gerkintrigg · December 1, 2009

I'm using the following code to try to count the number of times "web" occurs in the string:

$pattern='~\b'.$word.'\b(?![^<]*?>)~';
$string="websites on the web are cobwebs";
if($r['flagged']=='y'){
	$style='flagged';
	$plus = count(preg_match($pattern, strip_tags($my_page)));
	$_SESSION['flagged']=$_SESSION['flagged']+$plus;
}

I need to only count "web" and not "websites" or "cobwebs".

I know I could explode the string based on spaces but I have good reasons why not to - perhaps too much to mention that here though.

salathe · December 1, 2009

Given the $string and $word values, a basic method of counting occurrences of that word is like:

$string = 'websites on the web are cobwebs';
$word   = 'web';

$word_escaped = preg_quote($word, '~');
$pattern = '~\b' . $word_escaped . '\b~';

$count = preg_match_all($pattern, $string, $matches);
echo "'$word' occurs $count time(s) in '$string'.";

gerkintrigg · December 1, 2009

Excellent! that works fine. Thank you!

I used similar code to work out the total word count but I had to divide it by 2 for some reason:

<?php 
	  $pattern = '~\b\b(?![^<]*?>)~';
	  echo preg_match_all($pattern, $page, $matches)/2;?>

I am curious, but if it works, that's all I'm interested in for the moment.

Sign In

counting specific words in string (reliably)

Recommended Posts

gerkintrigg

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

oni-kun

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

gerkintrigg

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

gerkintrigg

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information