find most frequent word

imperium2335 · July 11, 2009

Hi,

I have downloaded a webpage using the file_get_content function and stuffed it into an array using explode(" ", $var) ;.

How do I now make it list say, the top 5 most common words?

Cheers.

ignace · July 11, 2009

$words = array();
foreach ($var as $word) {
    if (!array_key_exists($word, $words)) {
        $words[$word]= 1;
    } else {
        ++$words[$word];
    }
}

imperium2335 · July 11, 2009

if i echo $words[0] etc there is nothing there :S What is the output from this?

ignace · July 11, 2009

if i echo $words[0] etc there is nothing there :S What is the output from this?

That is because $words[0] doesn't exist. do print_r($words);

imperium2335 · July 11, 2009

thanks so far, but what I want is for it to count how many times a word occurs in the array, and then to output the top 5 words.

So far its giving me a page of goblegoop . one of the outputs is [4917] => information, information doesn't occur 4917 times :S.

.josh · July 11, 2009

array_count_values

thebadbad · July 11, 2009

Here's how you could use the code ignace provided:

<?php
$var_str = 'Words appear here words words here again. Some more words to fill up the list.';
$var = explode(' ', $var_str);
$words = array();
foreach ($var as $word) {
if (!array_key_exists($word, $words)) {
	$words[$word] = 1;
} else {
	++$words[$word];
}
}
//sort array while keeping the keys
arsort($words);
//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

Note that the code is case-sensitive.

But doing a simple explode() on the spaces doesn't take punctuation marks into account. Consider this example, using regular expressions:

<?php
$var_str = 'Words appear here: Words, words, here again. Some more words to fill up the list!';
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

imperium2335 · July 11, 2009

Thanks for your help! It is getting closer to what i wanted now.

The purpose is to determine the subject of a page based on what word occurs the most in the code, i've implemented strip_tags but now I would like to exclude common words like "the", "if", "is" etc.

Could you advise me on how to do this? much appreciated!

.josh · July 11, 2009

make an array of common words you want filtered out, use array_keys to grab the words from array_count_values, and use array_diff

thebadbad · July 11, 2009

Found this list of the 500 most common English words: http://www.world-english.org/english500.htm . If you want I can quickly put them in an array for you (nobody wants to manually enter all that).

imperium2335 · July 11, 2009

Hi, I have the following but its returning a blank page:


$commons = array("alot of words!!") ;

$stopwordcount = count($commons) ;

$site = Hunt($target, $useragent) ; //puts the site content into a variable string.

$headchop = explode("</head>", $site) ; //Split header and body.

$wordz = strip_tags($headchop[1]) ;

$wcount = str_word_count($words) ;

for($i = 0; $i < $wcount; $i++) {
for($x = 0; $x < $stopwordcount; $x++) {
	$words = ereg_replace("$commons[$x]", "", $wordz) ;
}
}

thebadbad · July 11, 2009

LOL, didn't know about the str_word_count() function. Clever (if you use it right).

Your script isn't told to output anything = a blank page. Also, you should use the method outlined by CV. Would be much more efficient.

imperium2335 · July 11, 2009

Sorry, later on it does:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

But now I am trying:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words= array_keys($words) ;

$words = array_diff($words, $commons) ;

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

No success though yet

PS how do i get it to filter out any array item that is a number and not a string? thanks for your help so far!

.josh · July 11, 2009

you should be using $keys as first argument in array_diff. And I assume that $commons is a real array of common words instead of

$commons = array("alot of words!!") ;

right?

imperium2335 · July 11, 2009

Hi, yea sorry I have changed it around. Yes it contains about 500 words, i thought best not to spam the forum with all that.

thebadbad · July 11, 2009

Better solution using str_word_count() (my new favorite function for the day):

<?php
$str = 'Words appear here: Words, words, here again. Some more words to fill up the list! The sentences, they\'re good and long-winded.';
//grab words
$words = str_word_count($str, 1);
//transform all words to lowercase
$words = array_map('strtolower', $words);
//remove common words
$commons = array('the'); //etc.
$words = array_diff($words, $commons);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

.josh · July 11, 2009

Couple things with your code, thebadbad:

Would be more efficient to strtolower the original string instead of using array_map. Also, I don't really see the point in using str_word_count when you're already using array_count_values. Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

My take:

$page = file_get_contents($url); // get the page contents
$page = strip_tags($page); // remove tags
$page = strtolower($page); // make case-insensitive
$page = preg_replace('~<script[^>]*>.*?</script>~s','',$page); // remove scripts there may be
preg_match_all('~[a-z]+~',$page,$words); // get array of words
$words = array_diff($words[0],$commonWords); // filter out common words
$words = array_count_values($words); // count occurances
arsort($words); // sort highest to lowest
$words = array_slice($words,0,5); // get top 5

The only thing still lacking is validating the words as real words. Like for instance, if the page has guess what, that's going to be counted as a word. I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate). Better solution would be to compare them against a list of real words.

thebadbad · July 11, 2009

Would be more efficient to strtolower the original string instead of using array_map.

True.

Also, I don't really see the point in using str_word_count when you're already using array_count_values.

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

Nope, I'm afraid that's false. The array expression is only run once.

The only thing still lacking is validating the words as real words. Like for instance, if the page has guess what, that's going to be counted as a word. I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate). Better solution would be to compare them against a list of real words.

You could get rid of any HTML entities with e.g. preg_replace('~&\S+;~', '', $str), but I agree that we would probably still end up with a lot of 'false' words.

imperium2335 · July 11, 2009

Is there anyway to collate the keys together? Like i have the 3 most frequent returned which is great, but how can i make it so $word[0] is the most frequent [1] is the second most etc?

thebadbad · July 11, 2009

After sorting array_count_values($words) with arsort(), you can simply do array_keys($words)

imperium2335 · July 11, 2009

Have tried it and swapped lots of things around but no luck:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words = array_diff($words, $commons) ;

$words = array_keys($words) ;

$nounsused = array_intersect($nouns, $words) ;

$nounsused = array_count_values($nounsused) ;

arsort($nounsused) ;

$minA = rand(0, 3) ;
$minB = rand(0, 3) ;

//output top 5
foreach (array_slice($words, $minA, 3) as $word => $counta) {
//  echo "$word ($count)<br />";
}

foreach (array_slice($nounsused, $minB, 3) as $nounused => $countb) {
//  echo "$nounused ($count)<br />";
}

$Most_Prom = rand(0, 2) ; $Most_Prom_Noun = rand(0, 2) ;

echo "says: " . $initfrags[array_rand($initfrags)] . " " . $word . " " . $midfrags[array_rand($midfrags)] . " " . $nounused . " " . $endfrags[array_rand($endfrags)] ;

i get this out put:

says: I think that 2 is very good, but process which is great.

The 2 shouldn't be there, it should be one of the 3 most popular words instead.

.josh · July 11, 2009

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

str_word_count returns the same thing as the [a-z]+ regex. As to whether it's more efficient than the regex...not sure about that. It's a pretty simple regex. I can't really imagine how str_word_count's internal regex could get any simpler. I don't really feel like benchmarking it, but I'll buy into str_word_count being 'easier' for people to understand. As far as not using it because of array_count_values: my mistake. I read your code wrong.

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

Nope, I'm afraid that's false. The array expression is only run once.

hmm... are you sure about that? I guess I don't see anything in the manual or user notes about it, but I coulda swore I remember seeing a thread here a while back that debated this, with benchmarks and stuff.

ignace · July 11, 2009

[a-z]+

D4rN n0 LEE+ 5P34k

.josh · July 11, 2009

[a-z]+

D4rN n0 LEE+ 5P34k

str_word_count() will not return that either. So if you want to include stuff like that, you'd have to regex it.

thebadbad · July 11, 2009

From the manual:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

So it's a bit more complex than the [a-z]+ regex.

Regarding the foreach question, I tested it with this:

<?php
function test($array) {
sleep(1);
return $array;
}
$array = range(0,9);
foreach (test($array) as $value) {
echo $value;
}
?>

Execution took ~ 1 second. Maybe you're thinking about a for loop, where the second expression is evaluated at the beginning of each iteration.

Sign In

find most frequent word

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information