Jump to content

word list algorithm


fooDigi

Recommended Posts

i have been racking me brain trying to find an effective way of doing this...

given a word list (20,000+) and a list of characters (about 5-10), how can i find all the words in the word list that use those characters only once...

 

so, example:

character_list = mlfrau

 

it would find:

stay,flu,arm,far,ram,fur,farm,maul,mural,armful

 

as long as they exist in the word list.

 

thanks for any help!

Link to comment
Share on other sites

That's a tough one.  Is your word list likely to change?  Is the list of characters likely to change?  If one is going to be mostly static, then we can choose an algorithm that takes advantage of that.

 

Assuming the word list of mostly static (that is, changes are infrequent) we can re-arrange it into a structure that aids the search.  A trie of some sort may be good.

 

Perhaps a reverse trie, where the nodes tell you what letter isn't in a word.  "All letters in the list must be present" is equivalent to "No letter in the list may not be present", so reversing the logic like that may open up better algorithms.

Link to comment
Share on other sites

First, build a sort_letters() function that will take a string and rearrange the letters so they are ordered in alphabetical order.

Now lets say that $words holds your 20K words and $letters holds your letters (the "char_list" as you defined)...

 

<?php

$letters = sort_letters($letters);

$results = array();
foreach($words as $word)
  if(strpos($letters, sort_letters($word)) !== FALSE)
      $results[] = $word;

//$results holds all of your words

?>

 

Now obviously your words are not stored in an array, but it's easy to apply the idea above on words that are in a database or something like that. And if you do have them in a database, I suggest adding a "sorted" column to the words table containing the sort_letters() version of the word, and compare that in strpos(). This way your runtime will get much faster.

 

 

Orio.

Link to comment
Share on other sites

I think I misunderstood the question.. there is a flaw in Orio's suggestion though.  The word "arm" will not match, because it does not contain the letters f and l.

 

Instead you could use a regexp match in place of the strpos.

 

 preg_match('|^a?f?l?m?r?u?$|', sort_letters($word))

 

That expression will allow missing letters in the middle.

 

I'm sure there must be a more efficient algorithm though .. and I'm sure some sort of modified trie will work here.

Link to comment
Share on other sites

i have been racking me brain trying to find an effective way of doing this...

given a word list (20,000+) and a list of characters (about 5-10), how can i find all the words in the word list that use those characters only once...

 

so, example:

character_list = mlfrau

 

it would find:

stay,flu,arm,far,ram,fur,farm,maul,mural,armful

 

as long as they exist in the word list.

 

thanks for any help!

Hello,

Not sure how this is math-related, but okay. How does it find the word "stay"? Also, can you repeat letters?

 

Ken

Link to comment
Share on other sites

thanks guys!

after this post i quickly realized that this probably wouldn't be the correct topic to post under. here is the link to my next post, which was the same, and here is the solution, posted by sasa...

 

Ken2k7, i assumed there could possibly be some math algorithm to parse character positions in multiple words, blah, blah. i had a regular expressions book next to me and didn't consult it ;)

 

<?php 
$char = 'mlfraum';
$word_list = 'frr,print,mam,the,ra,fratzl,rai,far,maul,maal,mail,ram,ramm,farm,farrm';
$a = explode(',',$word_list);
$char = str_split($char);
sort($char);
$char ='/^'.implode('?',$char).'?$/';

foreach ($a as $word){
$word1 = str_split($word);
sort($word1);
$word1 =implode('',$word1);
if (preg_match($char, $word1)) echo $word;
}
?>

Link to comment
Share on other sites

Ok, I've sorted it out.  Here's the fast trie algorithm (pseudocode only .. if you like it, I might be convinced to write real code)

 

1.  Each trie node is labelled with a single letter

2.  The root node is the labelled with nothing, and represents the start of a word

3.  Each node contains a list of words containing all letters up to that point

4.  There are up to 26 children of each node (assuming 26 letters).

5.  Nodes lower in the tree will have fewer children, because all children must have letters greater in the ordering than their parent.

6.  Any path from the root to a leaf passes letters in sorted order.

 

Example.

 

Word list:

the

ten

cat

 

Trie:

 

     root
a     e
c   h   n
t   t   t
|   |   |
cat the ten

 

To find matching words, you start at the top.  Follow all children which match any letter in your letter list.  If your letter list was "aent", you would follow the left path for only one step and then fail there.  But you could follow the right-most path all the way to the bottom and find the word "ten".

 

If your word list contained "he" and your letter list also contained "e" and "h", you would find "he" after following the "e" and "h" nodes.

 

This algorithm (providing the trie is pre-generated) will run much faster than the linear search method.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.