Jump to content

[SOLVED] I need some help on how to parse out common phrases in a document.


Recommended Posts

I'm trying to parse out the most common 4 word, 3 word, and 2 word phrases from documents.  I've got gigs of documents that I need to recursively parse through (aka, even though the data is from different sources it need to be treated as from a single source.)  I've been able to parse out the most common single words, but don't know how to efficiently parse out the most common multiple word combinations.

 

Any help would be appreciated.   

 

I was thinking about putting all the documents in to a database file (stripping all unnecessary punctuation, markup, etc).  And taking the 1st 4 words and searching for exact results though the database, come up with a number, then take the 2nd 3rd 4th and 5th words and searching again and repeating this through the database....  but this does not seem like the best way to do this.

 

 

 

 

class WordCounter

{

const ASC=1;

const DESC=2;

private $words;

function __construct($filename)

{

$file_content = file_get_contents($filename);

$this->words =

(array_count_values(str_word_count(strtolower

($file_content),1)));

}

public function count($order)

{

if ($order==self::ASC)

asort($this->words);

else if($order==self::DESC)

arsort($this->words);

foreach ($this->words as $key=>$val)

echo $key ." = ". $val."<br/>";

}

}

 

 

Thanks

 

-Brad

Link to comment
Share on other sites

function getPhraseCount($string, $numWords=1, $limit=0) {
  // make case-insensitive
  $string = strtolower($string);
  // get all words. Assume any 1 or more letter, number or ' in a row is a word 
  preg_match_all('~[a-z0-9\']+~',$string,$words);
  $words = $words[0];
  // foreach word...
  foreach($words as $k => $v) {
    // remove single quotes that are by themselves or wrapped around the word
    $words[$k] = trim($words[$k],"'");
  } // end foreach $words
  // remove any empty elements produced from ' trimming
  $words = array_filter($words);
  // reset array keys
  $words = array_values($words);
  // foreach word...  	
  foreach ($words as $k => $word) {
    // if there are enough words after the current word to make a $numWords length phrase... 
    if (isset($words[$k+$numWords])) {
      // add the phrase to list of phrases
      $phrases[] = implode(' ',array_slice($words,$k,$numWords));
    } // end if isset
  } // end foreach $words
  // create an array of phrases => count
  $x = array_count_values($phrases);
  // reverse sort it (preserving keys, since the keys are the phrases
  arsort($x);
  // if limit is specified, return only $limit phrases. otherwise, return all of them
  return ($limit > 0) ? array_slice($x,0,$limit) : $x;
} // end getPhraseCount

//examples:

getPhraseCount($string); // return full list of single keyword count 
getPhraseCount($string,2); // return full list of 2 word phrase count
getPhraseCount($string,2,10); // return top 10 list of 2 word phrase count

 

Description:

 

Okay, so basically this function will take the string and return a phrase => count  associative array.  If you only pass it the string, it defaults to doing a count of individual words and returning all of them in descending order.  Optional 2nd argument lets you specify how many words in the phrase.  So if you put 2 as 2nd argument, it will go through and for each word, take the word and the word after it and count how many times that 2 word phrase occurs, returning the list in descending order.  If the optional 3rd argument is used, it returns top x amount of words, so like 10 would return top 10 phrase occurance.

 

Limitations:

 

- hyphenated words are not matched. 

 

- case in-sensitive.

 

- assumes $string is "human" readable text.  In other words, if you were to pass a file_get_contents of some webpage to it, you should probably strip_tags first, as well as do some regex to remove stuff between script tags, etc...

 

 

 

 

 

Link to comment
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.