Jump to content

Grouping words


Cantfigureitout

Recommended Posts

Hi everyone,

 

I'm developing a little tool that analyzes content from a wide variety of sources in other words, a lot of data.

 

I need to analyze the most common paired and higher word groups.

 

So example:

 

"Johny went to the store after which Johny went to buy gas and at the end of the day Johny went home!"

 

What I'm trying to achieve is scan for groups of 2 words, groups of 3 words etc. so in this case it woud result in:

 

Most common groups of 2 words:

#1: "Johny went"  (found 3 times)

#2: "went to"  (found 2 times)

etc.

And same thing for groups of 3 words and possibly 4 depeng on how intensive it is

 

I can either dump ALL data into one huge variable containing all the text to be analyzed (around 100,000 words on average at the moment)

Or

(and this might result in better groupings too) split the content whenever a . or , or ; or ? or ! occurs and store things in an array (probably faster hehe)

 

Anyway, what do you guys think is the best way to then analyze the contents to count for word groupings?

 

Link to comment
https://forums.phpfreaks.com/topic/81863-grouping-words/
Share on other sites

By the way, my backup plan is to grab the top 500 most common SINGLE words and run it through the whole 100,000 words content and grab the words before and after it and come up with the word groups that way. But that's a backup if my above question would be too intensive or impossible (then again, nothing is impossible)

Link to comment
https://forums.phpfreaks.com/topic/81863-grouping-words/#findComment-415953
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.