imperium2335 Posted July 11, 2009 Share Posted July 11, 2009 Hi, I have downloaded a webpage using the file_get_content function and stuffed it into an array using explode(" ", $var) ;. How do I now make it list say, the top 5 most common words? Cheers. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/ Share on other sites More sharing options...
ignace Posted July 11, 2009 Share Posted July 11, 2009 $words = array(); foreach ($var as $word) { if (!array_key_exists($word, $words)) { $words[$word]= 1; } else { ++$words[$word]; } } Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873480 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 if i echo $words[0] etc there is nothing there :S What is the output from this? Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873482 Share on other sites More sharing options...
ignace Posted July 11, 2009 Share Posted July 11, 2009 if i echo $words[0] etc there is nothing there :S What is the output from this? That is because $words[0] doesn't exist. do print_r($words); Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873484 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 thanks so far, but what I want is for it to count how many times a word occurs in the array, and then to output the top 5 words. So far its giving me a page of goblegoop . one of the outputs is [4917] => information, information doesn't occur 4917 times :S. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873486 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 array_count_values Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873496 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 Here's how you could use the code ignace provided: <?php $var_str = 'Words appear here words words here again. Some more words to fill up the list.'; $var = explode(' ', $var_str); $words = array(); foreach ($var as $word) { if (!array_key_exists($word, $words)) { $words[$word] = 1; } else { ++$words[$word]; } } //sort array while keeping the keys arsort($words); //output top 5 foreach (array_slice($words, 0, 5) as $word => $count) { echo "$word ($count)<br />"; } ?> Note that the code is case-sensitive. But doing a simple explode() on the spaces doesn't take punctuation marks into account. Consider this example, using regular expressions: <?php $var_str = 'Words appear here: Words, words, here again. Some more words to fill up the list!'; //grab all words, regardless of any bounding punctuation marks (and number of spaces between words) preg_match_all('~\b\w+\b~', $var_str, $matches); //transform all words to lowercase $words = array_map('strtolower', $matches[0]); //get word frequency $words = array_count_values($words); //sort array while keeping the keys arsort($words); //output top 5 foreach (array_slice($words, 0, 5) as $word => $count) { echo "$word ($count)<br />"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873520 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Thanks for your help! It is getting closer to what i wanted now. The purpose is to determine the subject of a page based on what word occurs the most in the code, i've implemented strip_tags but now I would like to exclude common words like "the", "if", "is" etc. Could you advise me on how to do this? much appreciated! Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873528 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 make an array of common words you want filtered out, use array_keys to grab the words from array_count_values, and use array_diff Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873539 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 Found this list of the 500 most common English words: http://www.world-english.org/english500.htm . If you want I can quickly put them in an array for you (nobody wants to manually enter all that). Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873540 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Hi, I have the following but its returning a blank page: $commons = array("alot of words!!") ; $stopwordcount = count($commons) ; $site = Hunt($target, $useragent) ; //puts the site content into a variable string. $headchop = explode("</head>", $site) ; //Split header and body. $wordz = strip_tags($headchop[1]) ; $wcount = str_word_count($words) ; for($i = 0; $i < $wcount; $i++) { for($x = 0; $x < $stopwordcount; $x++) { $words = ereg_replace("$commons[$x]", "", $wordz) ; } } Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873541 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 LOL, didn't know about the str_word_count() function. Clever (if you use it right). Your script isn't told to output anything = a blank page. Also, you should use the method outlined by CV. Would be much more efficient. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873544 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Sorry, later on it does: $var_str = $words ; //grab all words, regardless of any bounding punctuation marks (and number of spaces between words) preg_match_all('~\b\w+\b~', $var_str, $matches); //transform all words to lowercase $words = array_map('strtolower', $matches[0]); //get word frequency $words = array_count_values($words); //sort array while keeping the keys arsort($words); //output top 5 foreach (array_slice($words, 0, 5) as $word => $count) { echo "$word ($count)<br />"; } But now I am trying: $var_str = $words ; //grab all words, regardless of any bounding punctuation marks (and number of spaces between words) preg_match_all('~\b\w+\b~', $var_str, $matches); //transform all words to lowercase $words = array_map('strtolower', $matches[0]); //get word frequency $words = array_count_values($words); //sort array while keeping the keys arsort($words); $words= array_keys($words) ; $words = array_diff($words, $commons) ; //output top 5 foreach (array_slice($words, 0, 5) as $word => $count) { echo "$word ($count)<br />"; } No success though yet PS how do i get it to filter out any array item that is a number and not a string? thanks for your help so far! Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873545 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 you should be using $keys as first argument in array_diff. And I assume that $commons is a real array of common words instead of $commons = array("alot of words!!") ; right? Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873550 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Hi, yea sorry I have changed it around. Yes it contains about 500 words, i thought best not to spam the forum with all that. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873554 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 Better solution using str_word_count() (my new favorite function for the day): <?php $str = 'Words appear here: Words, words, here again. Some more words to fill up the list! The sentences, they\'re good and long-winded.'; //grab words $words = str_word_count($str, 1); //transform all words to lowercase $words = array_map('strtolower', $words); //remove common words $commons = array('the'); //etc. $words = array_diff($words, $commons); //get word frequency $words = array_count_values($words); //sort array while keeping the keys arsort($words); //output top 5 foreach (array_slice($words, 0, 5) as $word => $count) { echo "$word ($count)<br />"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873558 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 Couple things with your code, thebadbad: Would be more efficient to strtolower the original string instead of using array_map. Also, I don't really see the point in using str_word_count when you're already using array_count_values. Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration. My take: $page = file_get_contents($url); // get the page contents $page = strip_tags($page); // remove tags $page = strtolower($page); // make case-insensitive $page = preg_replace('~<script[^>]*>.*?</script>~s','',$page); // remove scripts there may be preg_match_all('~[a-z]+~',$page,$words); // get array of words $words = array_diff($words[0],$commonWords); // filter out common words $words = array_count_values($words); // count occurances arsort($words); // sort highest to lowest $words = array_slice($words,0,5); // get top 5 The only thing still lacking is validating the words as real words. Like for instance, if the page has guess what, that's going to be counted as a word. I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate). Better solution would be to compare them against a list of real words. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873585 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 Would be more efficient to strtolower the original string instead of using array_map. True. Also, I don't really see the point in using str_word_count when you're already using array_count_values. What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing). Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration. Nope, I'm afraid that's false. The array expression is only run once. The only thing still lacking is validating the words as real words. Like for instance, if the page has guess what, that's going to be counted as a word. I suppose you can mostly get around that with some lookaround in the preg_match_all (would be more efficient but less accurate). Better solution would be to compare them against a list of real words. You could get rid of any HTML entities with e.g. preg_replace('~&\S+;~', '', $str), but I agree that we would probably still end up with a lot of 'false' words. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873602 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Is there anyway to collate the keys together? Like i have the 3 most frequent returned which is great, but how can i make it so $word[0] is the most frequent [1] is the second most etc? Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873617 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 After sorting array_count_values($words) with arsort(), you can simply do array_keys($words) Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873619 Share on other sites More sharing options...
imperium2335 Posted July 11, 2009 Author Share Posted July 11, 2009 Have tried it and swapped lots of things around but no luck: $var_str = $words ; //grab all words, regardless of any bounding punctuation marks (and number of spaces between words) preg_match_all('~\b\w+\b~', $var_str, $matches); //transform all words to lowercase $words = array_map('strtolower', $matches[0]); //get word frequency $words = array_count_values($words); //sort array while keeping the keys arsort($words); $words = array_diff($words, $commons) ; $words = array_keys($words) ; $nounsused = array_intersect($nouns, $words) ; $nounsused = array_count_values($nounsused) ; arsort($nounsused) ; $minA = rand(0, 3) ; $minB = rand(0, 3) ; //output top 5 foreach (array_slice($words, $minA, 3) as $word => $counta) { // echo "$word ($count)<br />"; } foreach (array_slice($nounsused, $minB, 3) as $nounused => $countb) { // echo "$nounused ($count)<br />"; } $Most_Prom = rand(0, 2) ; $Most_Prom_Noun = rand(0, 2) ; echo "says: " . $initfrags[array_rand($initfrags)] . " " . $word . " " . $midfrags[array_rand($midfrags)] . " " . $nounused . " " . $endfrags[array_rand($endfrags)] ; i get this out put: says: I think that 2 is very good, but process which is great. The 2 shouldn't be there, it should be one of the 3 most popular words instead. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873623 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 What do you mean? str_word_count($str, 1) grabs the words (more accurate than the ~[a-z]+~ regex) and array_count_values() counts the frequency of those words. When there is a built in function, I'd rather use that than a regular expression (if the functionality doesn't need customizing). str_word_count returns the same thing as the [a-z]+ regex. As to whether it's more efficient than the regex...not sure about that. It's a pretty simple regex. I can't really imagine how str_word_count's internal regex could get any simpler. I don't really feel like benchmarking it, but I'll buy into str_word_count being 'easier' for people to understand. As far as not using it because of array_count_values: my mistake. I read your code wrong. Also, it's technically more efficient to assign the array slice first, then use the array in the foreach, because if you put it in the loop like that, it performs array_slice every single iteration. Nope, I'm afraid that's false. The array expression is only run once. hmm... are you sure about that? I guess I don't see anything in the manual or user notes about it, but I coulda swore I remember seeing a thread here a while back that debated this, with benchmarks and stuff. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873637 Share on other sites More sharing options...
ignace Posted July 11, 2009 Share Posted July 11, 2009 [a-z]+ D4rN n0 LEE+ 5P34k Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873658 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 [a-z]+ D4rN n0 LEE+ 5P34k str_word_count() will not return that either. So if you want to include stuff like that, you'd have to regex it. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873660 Share on other sites More sharing options...
thebadbad Posted July 11, 2009 Share Posted July 11, 2009 From the manual: For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters. So it's a bit more complex than the [a-z]+ regex. Regarding the foreach question, I tested it with this: <?php function test($array) { sleep(1); return $array; } $array = range(0,9); foreach (test($array) as $value) { echo $value; } ?> Execution took ~ 1 second. Maybe you're thinking about a for loop, where the second expression is evaluated at the beginning of each iteration. Quote Link to comment https://forums.phpfreaks.com/topic/165600-find-most-frequent-word/#findComment-873670 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.