Jump to content

find most frequent word


imperium2335

Recommended Posts

Here's how you could use the code ignace provided:

 

<?php
$var_str = 'Words appear here words words here again. Some more words to fill up the list.';
$var = explode(' ', $var_str);
$words = array();
foreach ($var as $word) {
if (!array_key_exists($word, $words)) {
	$words[$word] = 1;
} else {
	++$words[$word];
}
}
//sort array while keeping the keys
arsort($words);
//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

 

Note that the code is case-sensitive.

 

But doing a simple explode() on the spaces doesn't take punctuation marks into account. Consider this example, using regular expressions:

 

<?php
$var_str = 'Words appear here: Words, words, here again. Some more words to fill up the list!';
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

Link to comment
Share on other sites

Thanks for your help! It is getting closer to what i wanted now.

 

The purpose is to determine the subject of a page based on what word occurs the most in the code, i've implemented strip_tags but now I would like to exclude common words like "the", "if", "is" etc.

 

Could you advise me on how to do this? much appreciated!  :)

Link to comment
Share on other sites

Hi, I have the following but its returning a blank page:

 


$commons = array("alot of words!!") ;

$stopwordcount = count($commons) ;

$site = Hunt($target, $useragent) ; //puts the site content into a variable string.

$headchop = explode("</head>", $site) ; //Split header and body.

$wordz = strip_tags($headchop[1]) ;

$wcount = str_word_count($words) ;

for($i = 0; $i < $wcount; $i++) {
for($x = 0; $x < $stopwordcount; $x++) {
	$words = ereg_replace("$commons[$x]", "", $wordz) ;
}
}

Link to comment
Share on other sites

Sorry, later on it does:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

But now I am trying:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words= array_keys($words) ;

$words = array_diff($words, $commons) ;

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

No success though yet :(

PS how do i get it to filter out any array item that is a number and not a string? thanks for your help so far!

Link to comment
Share on other sites

Better solution using str_word_count() (my new favorite function for the day):

 

<?php
$str = 'Words appear here: Words, words, here again. Some more words to fill up the list! The sentences, they\'re good and long-winded.';
//grab words
$words = str_word_count($str, 1);
//transform all words to lowercase
$words = array_map('strtolower', $words);
//remove common words
$commons = array('the'); //etc.
$words = array_diff($words, $commons);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

Link to comment
Share on other sites

Couple things with your code, thebadbad:

 

Would be more efficient to strtolower the original string instead of using array_map.  Also, I don't really see the point in using str_word_count when you're already using array_count_values.  Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

My take:

 

$page = file_get_contents($url); // get the page contents
$page = strip_tags($page); // remove tags
$page = strtolower($page); // make case-insensitive
$page = preg_replace('~<script[^>]*>.*?</script>~s','',$page); // remove scripts there may be
preg_match_all('~[a-z]+~',$page,$words); // get array of words
$words = array_diff($words[0],$commonWords); // filter out common words
$words = array_count_values($words); // count occurances
arsort($words); // sort highest to lowest
$words = array_slice($words,0,5); // get top 5

 

The only thing still lacking is validating the words as real words.  Like for instance, if the page has   guess what, that's going to be counted as a word.  I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate).  Better solution would be to compare them against a list of real words.

 

Link to comment
Share on other sites

Would be more efficient to strtolower the original string instead of using array_map.

 

True.

 

Also, I don't really see the point in using str_word_count when you're already using array_count_values.

 

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

 

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

Nope, I'm afraid that's false. The array expression is only run once.

 

The only thing still lacking is validating the words as real words.  Like for instance, if the page has   guess what, that's going to be counted as a word.  I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate).  Better solution would be to compare them against a list of real words.

 

You could get rid of any HTML entities with e.g. preg_replace('~&\S+;~', '', $str), but I agree that we would probably still end up with a lot of 'false' words.

Link to comment
Share on other sites

Have tried it and swapped lots of things around but no luck:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words = array_diff($words, $commons) ;

$words = array_keys($words) ;

$nounsused = array_intersect($nouns, $words) ;

$nounsused = array_count_values($nounsused) ;

arsort($nounsused) ;

$minA = rand(0, 3) ;
$minB = rand(0, 3) ;

//output top 5
foreach (array_slice($words, $minA, 3) as $word => $counta) {
//  echo "$word ($count)<br />";
}

foreach (array_slice($nounsused, $minB, 3) as $nounused => $countb) {
//  echo "$nounused ($count)<br />";
}

$Most_Prom = rand(0, 2) ; $Most_Prom_Noun = rand(0, 2) ;

echo "says: " . $initfrags[array_rand($initfrags)] . " " . $word . " " . $midfrags[array_rand($midfrags)] . " " . $nounused . " " . $endfrags[array_rand($endfrags)] ;

i get this out put:

 

says: I think that 2 is very good, but process which is great.

 

The 2 shouldn't be there, it should be one of the 3 most popular words instead.

Link to comment
Share on other sites

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

str_word_count returns the same thing as the [a-z]+ regex.  As to whether it's more efficient than the regex...not sure about that.  It's a pretty simple regex. I can't really imagine how str_word_count's internal regex could get any simpler.  I don't really feel like benchmarking it, but I'll buy into str_word_count being 'easier' for people to understand.  As far as not using it because of array_count_values: my mistake. I read your code wrong.

 

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

Nope, I'm afraid that's false. The array expression is only run once.

 

hmm... are you sure about that? I guess I don't see anything in the manual or user notes about it, but I coulda swore I remember seeing a thread here a while back that debated this, with benchmarks and stuff. 

 

 

Link to comment
Share on other sites

From the manual:

 

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

 

So it's a bit more complex than the [a-z]+ regex.

 

Regarding the foreach question, I tested it with this:

 

<?php
function test($array) {
sleep(1);
return $array;
}
$array = range(0,9);
foreach (test($array) as $value) {
echo $value;
}
?>

 

Execution took ~ 1 second. Maybe you're thinking about a for loop, where the second expression is evaluated at the beginning of each iteration.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.