Cantfigureitout Posted December 16, 2007 Share Posted December 16, 2007 Hi everyone, I'm developing a little tool that analyzes content from a wide variety of sources in other words, a lot of data. I need to analyze the most common paired and higher word groups. So example: "Johny went to the store after which Johny went to buy gas and at the end of the day Johny went home!" What I'm trying to achieve is scan for groups of 2 words, groups of 3 words etc. so in this case it woud result in: Most common groups of 2 words: #1: "Johny went" (found 3 times) #2: "went to" (found 2 times) etc. And same thing for groups of 3 words and possibly 4 depeng on how intensive it is I can either dump ALL data into one huge variable containing all the text to be analyzed (around 100,000 words on average at the moment) Or (and this might result in better groupings too) split the content whenever a . or , or ; or ? or ! occurs and store things in an array (probably faster hehe) Anyway, what do you guys think is the best way to then analyze the contents to count for word groupings? Quote Link to comment Share on other sites More sharing options...
Cantfigureitout Posted December 16, 2007 Author Share Posted December 16, 2007 By the way, my backup plan is to grab the top 500 most common SINGLE words and run it through the whole 100,000 words content and grab the words before and after it and come up with the word groups that way. But that's a backup if my above question would be too intensive or impossible (then again, nothing is impossible) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.