Jump to content
moose-en-a-gant

A database of nouns,adverbs,verbs...

Recommended Posts

Curious how I should go about creating this

 

Creating an array / list of common everyday words with identifiers like nouns/pronouns/adverbs/verbs...

 

Wondering if I have to scrape grammar/dictionary websites or if this already exists

 

What could be a better approach?

Share this post


Link to post
Share on other sites

Not enough information on what you are trying to accomplish to provide a response. What do you consider "common everyday" words? How are you planning to create that list? Some words can be used as more than one grammar type:

 

Did you feed the dog?

Did you buy the feed for the horses?

 

We need a post for that sign?

The bank will post that transaction tomorrow?

 

etc., etc.

Share this post


Link to post
Share on other sites

I see your point, I'm creating a learning application where I can just go to a website, select all, copy, drop it into the text input and then a summary is generated after I have "taught" it to learn so to speak, eg. provided samples... more on this, a paragraph and then the words I pulled out... I believe that with the English language in general there is a structure (of course) but I mean, you could create a pattern recognition where the summary or the main point of the entire webpage could be found and then I would just create a curl or some sort of scraper that would search multiple websites looking for the same topic and creating a collection of summaries and then these would ideally be read back to me haha <- gotta hire a narrator 

 

I'm wondering about how strings are parsed eg. left to right top to bottom so I probably can't explode the words with identifiers like a sort of array and then isolate and find word frequencies and such without going from left to right through every word

Share this post


Link to post
Share on other sites

Not enough information on what you are trying to accomplish to provide a response. What do you consider "common everyday" words? How are you planning to create that list? Some words can be used as more than one grammar type:

 

Did you feed the dog?

Did you buy the feed for the horses?

 

We need a post for that sign?

The bank will post that transaction tomorrow?

 

etc., etc.

 

Then there is

 

Time flies like an arrow

Fruit flies like a banana

 

and

 

The weary ploughman plods his homeward way,

The ploughman, weary, plods his homeward way,

His homeward way the weary ploughman plods,

His homeward way the ploughman weary plods,

The weary ploughman homeward plods his way,

The ploughman, weary, homeward plods his way,

His way, the weary ploughman homeward plods,

His way, the ploughman, weary, homeward plods,

The ploughman, homeward, plods his weary way,

His way the ploughman, homeward, weary plods,

His homeward weary way the ploughman plods,

Weary, the ploughman homeward plods his way,

Weary, the ploughman plods his homeward way,

Homeward, his way the weary ploughman plods,

Homeward, his way the ploughman, weary, plods,

Homeward, his weary way, the ploughman plods,

The ploughman, homeward, weary plods his way,

The ploughman, weary, homeward plods his way,

His weary way, the ploughman homeward plods,

His weary way, the homeward ploughman plods,

Homeward the plowman plods his weary way,

Homeward the weary ploughman plods his way,

The weary ploughman, his way, homeward plods,

The ploughman, weary, his way homeward plods,

The ploughman plods his weary, homeward way,

Weary, the ploughman, his way homeward plods,

Weary, his homeward way the ploughman plods.

Share this post


Link to post
Share on other sites

I started a project like this a year or so ago.

 

Basically i've written an interface for adding words and their type(s), and manually added the entries... IMHO scraping would be either illegal or just not in the spirit of things.

 

One way I add words is by parsing sample text and identifying the unknown words, which I then work through.

 

 

To the DB, there's two tables (actually 3), one for the words (id,word,word2,added,user,status) and another for the lists (id,word_id,type,added,user,status). Third is for training content.

 

Obviously the word goes in the word tables word entry, then for the selected word types associated with the word get their own entry in the list table.

* I have separate entries for plurals, etc (even though it can recognise plurals, prefixes, etc)

* word2 is an ordered version of the word for quicker anagram solving.

 

Here's a list of word types i'm using so far (there is another list which groups these)

$wordtypes=array("adjective","adjective_continent","adjective_personality_negative","adjective_personality_positive","adverb","adverb_completeness","adverb_frequency","adverb_how","adverb_manner","adverb_place","adverb_purpose","adverb_time","adverb_time_frequency","adverb_time_frequency_indef","adverb_time_point","adverb_time_relationship","adverb_what_extent","adverb_when","adverb_where","contraction_informal","interjection","noun","noun_continent","noun_country","noun_fruit","noun_names_boys_eng","noun_names_girls_eng","noun_names_unisex_eng","noun_surname_eng","prefix","preposition","pronoun","question_words","stopword","suffix_derivational","suffix_inflectional","verb","verb_regular",
	"plural","noun_phrase","verb_participle","verb_transitive","verb_intransitive","conjunction","definite_article","indefinite_article","nominative");
I can't currently tell you how many words I have because I've re-installed my OS recently and haven't got around to re-installing my word DB yet. But I believe I have around 10,000 words. It may not seem like many (and it's not) but it is enough to parse most children's books which was my reason for doing this.

 

One helpful (even though baffling at first) book I have is:

http://www.amazon.co.uk/Finite-state-Language-Processing-Speech-Communication/dp/0262181827/ref=sr_1_1?&keywords=finite-state+language+processing

but a great one for the shelf is:

http://www.amazon.co.uk/Structure-Magic-About-Language-Therapy/dp/0831400447/ref=sr_1_1?keywords=the+structure+of+magic

The latter book is nothing to do with computer programming but rather NLP

 

Both will help with the understanding of the structures of sentences. May I also point out, English may be one of the harder languages because of all the beautiful ambiguities.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.