Grouping words

Cantfigureitout · December 16, 2007

Hi everyone,

I'm developing a little tool that analyzes content from a wide variety of sources in other words, a lot of data.

I need to analyze the most common paired and higher word groups.

So example:

"Johny went to the store after which Johny went to buy gas and at the end of the day Johny went home!"

What I'm trying to achieve is scan for groups of 2 words, groups of 3 words etc. so in this case it woud result in:

Most common groups of 2 words:

#1: "Johny went" (found 3 times)

#2: "went to" (found 2 times)

etc.

And same thing for groups of 3 words and possibly 4 depeng on how intensive it is

I can either dump ALL data into one huge variable containing all the text to be analyzed (around 100,000 words on average at the moment)

Or

(and this might result in better groupings too) split the content whenever a . or , or ; or ? or ! occurs and store things in an array (probably faster hehe)

Anyway, what do you guys think is the best way to then analyze the contents to count for word groupings?

Cantfigureitout · December 16, 2007

By the way, my backup plan is to grab the top 500 most common SINGLE words and run it through the whole 100,000 words content and grab the words before and after it and come up with the word groups that way. But that's a backup if my above question would be too intensive or impossible (then again, nothing is impossible)

Sign In

Grouping words

Recommended Posts

Cantfigureitout

Link to comment

Share on other sites

Cantfigureitout

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information