richardjh Posted November 14, 2011 Share Posted November 14, 2011 I've been trying to come up with a way to accurately count a word string which includes punctuation marks. I've got close but the white-space is causing a problem. I have used a couple of functions such as str_replace and explode and I can now get an accurate count for texts with most punctuation and 'normal' white space. BUT.. If I put in three extra spaces between words the count adds 1 to the total. $words2 = str_replace("-", "", $string); // strips a hypen $words3 = str_replace(' ', ' ', $words2); // strips double spaces $words = explode(' ', $words3); This is obviously NOT a script - rather a string of text being passed through three functions to cleanse it. But as you can see it will strip two white spaces when they occur but no more. So three, four, five etc. will be counted as extra words. What else could I try to give me an accurate count? I am very raw at php. :'( thanks Richard Quote Link to comment Share on other sites More sharing options...
KevinM1 Posted November 14, 2011 Share Posted November 14, 2011 Look into using trim Quote Link to comment Share on other sites More sharing options...
richardjh Posted November 14, 2011 Author Share Posted November 14, 2011 Yeah, I've used that as well and combinations of all those functions but so far I've not clinched it. Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 14, 2011 Share Posted November 14, 2011 Well, it all depends on what you consider a word. Looking at what you have it appears that you want any group of characters that are separated by any number of spaces or hyphens to be considered words. You could use string functions or possibly regular expressions. But, as you saw with the string functions you can't (directly) account for when there are any number of multiple spaces. So, one solution would be to use a loop that continues indefinitely as long as there are any consecutive spaces. function wordCount($string) { //Replace hyphens with spaces $string = str_replace('-', '', $string); //As long as there are double spaces - replace with single spaces while(strpos($string, ' ')!==false) { $string = str_replace(' ', ' ', $string); } return count(explode(' ', trim($string))); } For some reason I think there has to be a built in function that would be more appropriate, but I can't think of it at the moment. Also, what about other "non printable" characters? Or, what if there are punctuation characters by themselves? If you want only alpha-numeric characters to be used as possible words, then a regex solution is probably better. function wordCount($string) { return preg_match_all("#\b[\w]+\b#", $string, $matches); } In the above, the characters a-z, A-Z, 0-9 and the underscore can make up words. Quote Link to comment Share on other sites More sharing options...
richardjh Posted November 14, 2011 Author Share Posted November 14, 2011 Thank you for the quick replies and help. I found that these three lines seem to be doing what I want: $word1 = preg_replace('/\s+/', ' ', $text); $word = explode(' ', $word1); $words = count($word); Using this I can put any amount of white space between words and the count remains the same (which I want). I will test it a bit more though before getting my hopes up. Quote Link to comment Share on other sites More sharing options...
The Little Guy Posted November 14, 2011 Share Posted November 14, 2011 Try this, it converts 2+ spaces into 1 space. It will the split the words into an array, any empty item in the array was a punctuation mark. <?php header("content-type: text/plain"); $str = "this is my string. It is awesome!"; $str = preg_replace("/\s\s+/", " ", $str); $arr = preg_split("/ |!|\./", $str); print_r($arr); echo $str; ?> Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 14, 2011 Share Posted November 14, 2011 If you are going to use regular expression, I already gave you a single line solution that works: function wordCount($string) { return preg_match_all("#\b[\w]+\b#", $string, $matches); } Or, if you don't want to use a function $words = preg_match_all("#\b[\w]+\b#", $string, $matches); As I stated above this counts anything that is a-z, A-Z, 0-9 and underscore as possible words. If you want to expand the list of characters (or instead use a black list) that is easy to modify as well. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.