btray77 Posted August 13, 2009 Share Posted August 13, 2009 I'm trying to parse out the most common 4 word, 3 word, and 2 word phrases from documents. I've got gigs of documents that I need to recursively parse through (aka, even though the data is from different sources it need to be treated as from a single source.) I've been able to parse out the most common single words, but don't know how to efficiently parse out the most common multiple word combinations. Any help would be appreciated. I was thinking about putting all the documents in to a database file (stripping all unnecessary punctuation, markup, etc). And taking the 1st 4 words and searching for exact results though the database, come up with a number, then take the 2nd 3rd 4th and 5th words and searching again and repeating this through the database.... but this does not seem like the best way to do this. class WordCounter { const ASC=1; const DESC=2; private $words; function __construct($filename) { $file_content = file_get_contents($filename); $this->words = (array_count_values(str_word_count(strtolower ($file_content),1))); } public function count($order) { if ($order==self::ASC) asort($this->words); else if($order==self::DESC) arsort($this->words); foreach ($this->words as $key=>$val) echo $key ." = ". $val."<br/>"; } } Thanks -Brad Link to comment Share on other sites More sharing options...
.josh Posted August 13, 2009 Share Posted August 13, 2009 function getPhraseCount($string, $numWords=1, $limit=0) { // make case-insensitive $string = strtolower($string); // get all words. Assume any 1 or more letter, number or ' in a row is a word preg_match_all('~[a-z0-9\']+~',$string,$words); $words = $words[0]; // foreach word... foreach($words as $k => $v) { // remove single quotes that are by themselves or wrapped around the word $words[$k] = trim($words[$k],"'"); } // end foreach $words // remove any empty elements produced from ' trimming $words = array_filter($words); // reset array keys $words = array_values($words); // foreach word... foreach ($words as $k => $word) { // if there are enough words after the current word to make a $numWords length phrase... if (isset($words[$k+$numWords])) { // add the phrase to list of phrases $phrases[] = implode(' ',array_slice($words,$k,$numWords)); } // end if isset } // end foreach $words // create an array of phrases => count $x = array_count_values($phrases); // reverse sort it (preserving keys, since the keys are the phrases arsort($x); // if limit is specified, return only $limit phrases. otherwise, return all of them return ($limit > 0) ? array_slice($x,0,$limit) : $x; } // end getPhraseCount //examples: getPhraseCount($string); // return full list of single keyword count getPhraseCount($string,2); // return full list of 2 word phrase count getPhraseCount($string,2,10); // return top 10 list of 2 word phrase count Description: Okay, so basically this function will take the string and return a phrase => count associative array. If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order. Optional 2nd argument lets you specify how many words in the phrase. So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order. If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance. Limitations: - hyphenated words are not matched. - case in-sensitive. - assumes $string is "human" readable text. In other words, if you were to pass a file_get_contents of some webpage to it, you should probably strip_tags first, as well as do some regex to remove stuff between script tags, etc... Link to comment Share on other sites More sharing options...
btray77 Posted August 13, 2009 Author Share Posted August 13, 2009 Thank you for the quick reply! And I believe that will work for me! Now to figure out hot to mark as solved.. Thanks again -Brad Link to comment Share on other sites More sharing options...
Recommended Posts