fooDigi Posted February 20, 2008 Share Posted February 20, 2008 i have been racking me brain trying to find an effective way of doing this... given a word list (20,000+) and a list of characters (about 5-10), how can i find all the words in the word list that use those characters only once... so, example: character_list = mlfrau it would find: stay,flu,arm,far,ram,fur,farm,maul,mural,armful as long as they exist in the word list. thanks for any help! Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/ Share on other sites More sharing options...
btherl Posted February 20, 2008 Share Posted February 20, 2008 That's a tough one. Is your word list likely to change? Is the list of characters likely to change? If one is going to be mostly static, then we can choose an algorithm that takes advantage of that. Assuming the word list of mostly static (that is, changes are infrequent) we can re-arrange it into a structure that aids the search. A trie of some sort may be good. Perhaps a reverse trie, where the nodes tell you what letter isn't in a word. "All letters in the list must be present" is equivalent to "No letter in the list may not be present", so reversing the logic like that may open up better algorithms. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-471394 Share on other sites More sharing options...
Orio Posted February 20, 2008 Share Posted February 20, 2008 First, build a sort_letters() function that will take a string and rearrange the letters so they are ordered in alphabetical order. Now lets say that $words holds your 20K words and $letters holds your letters (the "char_list" as you defined)... <?php $letters = sort_letters($letters); $results = array(); foreach($words as $word) if(strpos($letters, sort_letters($word)) !== FALSE) $results[] = $word; //$results holds all of your words ?> Now obviously your words are not stored in an array, but it's easy to apply the idea above on words that are in a database or something like that. And if you do have them in a database, I suggest adding a "sorted" column to the words table containing the sort_letters() version of the word, and compare that in strpos(). This way your runtime will get much faster. Orio. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-471851 Share on other sites More sharing options...
btherl Posted February 20, 2008 Share Posted February 20, 2008 I think I misunderstood the question.. there is a flaw in Orio's suggestion though. The word "arm" will not match, because it does not contain the letters f and l. Instead you could use a regexp match in place of the strpos. preg_match('|^a?f?l?m?r?u?$|', sort_letters($word)) That expression will allow missing letters in the middle. I'm sure there must be a more efficient algorithm though .. and I'm sure some sort of modified trie will work here. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-472309 Share on other sites More sharing options...
Ken2k7 Posted February 21, 2008 Share Posted February 21, 2008 i have been racking me brain trying to find an effective way of doing this... given a word list (20,000+) and a list of characters (about 5-10), how can i find all the words in the word list that use those characters only once... so, example: character_list = mlfrau it would find: stay,flu,arm,far,ram,fur,farm,maul,mural,armful as long as they exist in the word list. thanks for any help! Hello, Not sure how this is math-related, but okay. How does it find the word "stay"? Also, can you repeat letters? Ken Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-473094 Share on other sites More sharing options...
fooDigi Posted February 23, 2008 Author Share Posted February 23, 2008 thanks guys! after this post i quickly realized that this probably wouldn't be the correct topic to post under. here is the link to my next post, which was the same, and here is the solution, posted by sasa... Ken2k7, i assumed there could possibly be some math algorithm to parse character positions in multiple words, blah, blah. i had a regular expressions book next to me and didn't consult it <?php $char = 'mlfraum'; $word_list = 'frr,print,mam,the,ra,fratzl,rai,far,maul,maal,mail,ram,ramm,farm,farrm'; $a = explode(',',$word_list); $char = str_split($char); sort($char); $char ='/^'.implode('?',$char).'?$/'; foreach ($a as $word){ $word1 = str_split($word); sort($word1); $word1 =implode('',$word1); if (preg_match($char, $word1)) echo $word; } ?> Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-474388 Share on other sites More sharing options...
Orio Posted February 23, 2008 Share Posted February 23, 2008 Well that's pretty much what I suggested here, without the "flaw" btherl mentioned (it was fixed nicely btw using preg_match()). Orio. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-474402 Share on other sites More sharing options...
fooDigi Posted February 23, 2008 Author Share Posted February 23, 2008 orio, thx i did come back to yours at noticed that the methodology was similar. thanks again for your response, wish i would've looked back at this post earlier. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-474409 Share on other sites More sharing options...
btherl Posted February 25, 2008 Share Posted February 25, 2008 Ok, I've sorted it out. Here's the fast trie algorithm (pseudocode only .. if you like it, I might be convinced to write real code) 1. Each trie node is labelled with a single letter 2. The root node is the labelled with nothing, and represents the start of a word 3. Each node contains a list of words containing all letters up to that point 4. There are up to 26 children of each node (assuming 26 letters). 5. Nodes lower in the tree will have fewer children, because all children must have letters greater in the ordering than their parent. 6. Any path from the root to a leaf passes letters in sorted order. Example. Word list: the ten cat Trie: root a e c h n t t t | | | cat the ten To find matching words, you start at the top. Follow all children which match any letter in your letter list. If your letter list was "aent", you would follow the left path for only one step and then fail there. But you could follow the right-most path all the way to the bottom and find the word "ten". If your word list contained "he" and your letter list also contained "e" and "h", you would find "he" after following the "e" and "h" nodes. This algorithm (providing the trie is pre-generated) will run much faster than the linear search method. Link to comment https://forums.phpfreaks.com/topic/92041-word-list-algorithm/#findComment-475454 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.