Jump to content

Recommended Posts

thanks so far, but what I want is for it to count how many times a word occurs in the array, and then to output the top 5 words.

 

So far its giving me a page of goblegoop :(. one of the outputs is [4917] => information, information doesn't occur 4917 times :S.

Here's how you could use the code ignace provided:

 

<?php
$var_str = 'Words appear here words words here again. Some more words to fill up the list.';
$var = explode(' ', $var_str);
$words = array();
foreach ($var as $word) {
if (!array_key_exists($word, $words)) {
	$words[$word] = 1;
} else {
	++$words[$word];
}
}
//sort array while keeping the keys
arsort($words);
//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

 

Note that the code is case-sensitive.

 

But doing a simple explode() on the spaces doesn't take punctuation marks into account. Consider this example, using regular expressions:

 

<?php
$var_str = 'Words appear here: Words, words, here again. Some more words to fill up the list!';
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

Thanks for your help! It is getting closer to what i wanted now.

 

The purpose is to determine the subject of a page based on what word occurs the most in the code, i've implemented strip_tags but now I would like to exclude common words like "the", "if", "is" etc.

 

Could you advise me on how to do this? much appreciated!  :)

Hi, I have the following but its returning a blank page:

 


$commons = array("alot of words!!") ;

$stopwordcount = count($commons) ;

$site = Hunt($target, $useragent) ; //puts the site content into a variable string.

$headchop = explode("</head>", $site) ; //Split header and body.

$wordz = strip_tags($headchop[1]) ;

$wcount = str_word_count($words) ;

for($i = 0; $i < $wcount; $i++) {
for($x = 0; $x < $stopwordcount; $x++) {
	$words = ereg_replace("$commons[$x]", "", $wordz) ;
}
}

Sorry, later on it does:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

But now I am trying:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words= array_keys($words) ;

$words = array_diff($words, $commons) ;

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
   echo "$word ($count)<br />";
}

No success though yet :(

PS how do i get it to filter out any array item that is a number and not a string? thanks for your help so far!

Better solution using str_word_count() (my new favorite function for the day):

 

<?php
$str = 'Words appear here: Words, words, here again. Some more words to fill up the list! The sentences, they\'re good and long-winded.';
//grab words
$words = str_word_count($str, 1);
//transform all words to lowercase
$words = array_map('strtolower', $words);
//remove common words
$commons = array('the'); //etc.
$words = array_diff($words, $commons);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

//output top 5
foreach (array_slice($words, 0, 5) as $word => $count) {
echo "$word ($count)<br />";
}
?>

Couple things with your code, thebadbad:

 

Would be more efficient to strtolower the original string instead of using array_map.  Also, I don't really see the point in using str_word_count when you're already using array_count_values.  Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

My take:

 

$page = file_get_contents($url); // get the page contents
$page = strip_tags($page); // remove tags
$page = strtolower($page); // make case-insensitive
$page = preg_replace('~<script[^>]*>.*?</script>~s','',$page); // remove scripts there may be
preg_match_all('~[a-z]+~',$page,$words); // get array of words
$words = array_diff($words[0],$commonWords); // filter out common words
$words = array_count_values($words); // count occurances
arsort($words); // sort highest to lowest
$words = array_slice($words,0,5); // get top 5

 

The only thing still lacking is validating the words as real words.  Like for instance, if the page has   guess what, that's going to be counted as a word.  I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate).  Better solution would be to compare them against a list of real words.

 

Would be more efficient to strtolower the original string instead of using array_map.

 

True.

 

Also, I don't really see the point in using str_word_count when you're already using array_count_values.

 

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

 

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

Nope, I'm afraid that's false. The array expression is only run once.

 

The only thing still lacking is validating the words as real words.  Like for instance, if the page has   guess what, that's going to be counted as a word.  I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate).  Better solution would be to compare them against a list of real words.

 

You could get rid of any HTML entities with e.g. preg_replace('~&\S+;~', '', $str), but I agree that we would probably still end up with a lot of 'false' words.

Have tried it and swapped lots of things around but no luck:

$var_str = $words ;
//grab all words, regardless of any bounding punctuation marks (and number of spaces between words)
preg_match_all('~\b\w+\b~', $var_str, $matches);
//transform all words to lowercase
$words = array_map('strtolower', $matches[0]);
//get word frequency
$words = array_count_values($words);
//sort array while keeping the keys
arsort($words);

$words = array_diff($words, $commons) ;

$words = array_keys($words) ;

$nounsused = array_intersect($nouns, $words) ;

$nounsused = array_count_values($nounsused) ;

arsort($nounsused) ;

$minA = rand(0, 3) ;
$minB = rand(0, 3) ;

//output top 5
foreach (array_slice($words, $minA, 3) as $word => $counta) {
//  echo "$word ($count)<br />";
}

foreach (array_slice($nounsused, $minB, 3) as $nounused => $countb) {
//  echo "$nounused ($count)<br />";
}

$Most_Prom = rand(0, 2) ; $Most_Prom_Noun = rand(0, 2) ;

echo "says: " . $initfrags[array_rand($initfrags)] . " " . $word . " " . $midfrags[array_rand($midfrags)] . " " . $nounused . " " . $endfrags[array_rand($endfrags)] ;

i get this out put:

 

says: I think that 2 is very good, but process which is great.

 

The 2 shouldn't be there, it should be one of the 3 most popular words instead.

What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing).

str_word_count returns the same thing as the [a-z]+ regex.  As to whether it's more efficient than the regex...not sure about that.  It's a pretty simple regex. I can't really imagine how str_word_count's internal regex could get any simpler.  I don't really feel like benchmarking it, but I'll buy into str_word_count being 'easier' for people to understand.  As far as not using it because of array_count_values: my mistake. I read your code wrong.

 

Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration.

 

Nope, I'm afraid that's false. The array expression is only run once.

 

hmm... are you sure about that? I guess I don't see anything in the manual or user notes about it, but I coulda swore I remember seeing a thread here a while back that debated this, with benchmarks and stuff. 

 

 

From the manual:

 

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

 

So it's a bit more complex than the [a-z]+ regex.

 

Regarding the foreach question, I tested it with this:

 

<?php
function test($array) {
sleep(1);
return $array;
}
$array = range(0,9);
foreach (test($array) as $value) {
echo $value;
}
?>

 

Execution took ~ 1 second. Maybe you're thinking about a for loop, where the second expression is evaluated at the beginning of each iteration.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.