Jump to content

counting specific words in string (reliably)


gerkintrigg

Recommended Posts

What's the easiest way of counting specific words in a string?

 

If I had a string like this:

hear heard hearing heard <b>hear hear hear</b>hear

and counted them using any of the popular methods like substr_count I'd end up with anything from 3 to 8. I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing" but also to ensure that it works out when the tags are just before/after it and to pick up those words too... If I explode the string by a space, it won't pick up the words located right next to a tag.

 

i tried using preg_match and preg_match_all but can't work out how to count results from the matching.

 

Could someone please help?

Link to comment
Share on other sites

I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing"...

 

??? How are "heard" or "hearing" not words? If I was manually doing a word count I would count them as words. If you want code to not count different tenses of words you have a very, very long project on your hands. You will need to build an extensive dictionary of words and their tenses.

 

As for the html tags, I would suggest using preg_replace to remove any tags before doing the word counts (and also use it to remove any multiple spaces). Then do an explode to create an array of each word in the string. Then use array_unique() to end up only with the unique words. You will then need to create a custom function to remove different tenses of words.

Link to comment
Share on other sites

I want it to pick up (like a human would) ONLY the word "hear" and not "heard" or "hearing"...

 

??? How are "heard" or "hearing" not words? If I was manually doing a word count I would count them as words. If you want code to not count different tenses of words you have a very, very long project on your hands. You will need to build an extensive dictionary of words and their tenses.

 

As for the html tags, I would suggest using preg_replace to remove any tags before doing the word counts (and also use it to remove any multiple spaces). Then do an explode to create an array of each word in the string. Then use array_unique() to end up only with the unique words. You will then need to create a custom function to remove different tenses of words.

I believe the OP's objective is to simply count the number of instances of a specified word. In the example given that should be 'hear', which shouldn't match 'heard' or 'hearing'. It also shouldn't match anything inside a HTML tag such as '<div class="hear">', it just wasn't terribly well explained.

 

To count them you should simply need to use preg_match_all (using the word boundary solution already discussed in at least one other thread with the OP) and then to simply use count to count the number of items returned. To ignore the contents of tags, the simplest solution would be as mjdamato said, to strip the tags first (using strip_tags or a Regular Expression).

Link to comment
Share on other sites

I'm using the following code to try to count the number of times "web" occurs in the string:

$pattern='~\b'.$word.'\b(?![^<]*?>)~';
$string="websites on the web are cobwebs";
if($r['flagged']=='y'){
	$style='flagged';
	$plus = count(preg_match($pattern, strip_tags($my_page)));
	$_SESSION['flagged']=$_SESSION['flagged']+$plus;
}

I need to only count "web" and not "websites" or "cobwebs".

 

I know I could explode the string based on spaces but I have good reasons why not to - perhaps too much to mention that here though.

Link to comment
Share on other sites

Given the $string and $word values, a basic method of counting occurrences of that word is like:

 

$string = 'websites on the web are cobwebs';
$word   = 'web';

$word_escaped = preg_quote($word, '~');
$pattern = '~\b' . $word_escaped . '\b~';

$count = preg_match_all($pattern, $string, $matches);
echo "'$word' occurs $count time(s) in '$string'.";

Link to comment
Share on other sites

Excellent! that works fine. Thank you!

 

I used similar code to work out the total word count but I had to divide it by 2 for some reason:

<?php 
	  $pattern = '~\b\b(?![^<]*?>)~';
	  echo preg_match_all($pattern, $page, $matches)/2;?>

I am curious, but if it works, that's all I'm interested in for the moment.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.