Jump to content

most eficient way to do a dictionary search


funkybeat

Recommended Posts

Hi phpFreaks!

 

To improve my php skills I do some private coding for my own.

I did some small php scripts in the past, however never worked with the "search/DB" topic.

 

This is where someone may help me a bit to give me a good start for my next project:

 

"A scrabble search engine based on a dictionary with php & mysql"

I'll try to do this as most efficient (fast) as possible.

 

Requirements:

 

> "field to enter random 1-14 letters" to search for words in the DB

> "Dictionary in Database" or in file but I dont think that may be faster...

 

My plan:

 

> Put every single word into a database table called "words" (EDIT: maybe I should add a second row and save the "word length" of each word as well)

> split up the 1-14 letters into a letter array

> build a loop which groups the 1-14 letterArray together in all possible combinations:

> 1st run: two-party groups -> check with "like" if its in the database for every single entry and put the result in a resultArray (EDIT: only check the words with the "word length" = 2)

> 2nd run: three-party groups -> check with "like" if its in the database for every single entry and attach the result to the resultArray (EDIT: only check the words with the "word length" = 3)

.....>14th run...

> output resultArray

 

 

I'm not looking for code parts, I'm just looking for a good concept to start with. Does someone have any recommendations on how I could do it better than the above?

In my eyes there are too many DB search requests and maybe there is a better way to do it?

 

Cheers

Mike

Edited by funkybeat
Link to comment
Share on other sites

Assuming you're going to save each form of a word as a new word in itself, which would be easiest, instead of saving the root form and then having another table for the other forms/suffixes.

 

You don't need to use like at all, and this challenge becomes rather simple: Add an index on the word, preferably a unique, and you're pretty much set. Seeing as you want words of that specific length, and not words containing a certain letter combination.

 

Additionally, for the low number of words in the above database, and the high number of combinations possible, I'd probably just retrieve all of them first. Stuff them into an array as keys, and then do an check for isset () in the array that generates/checks the letter combinations. Always query with and for the least amount of data, using the minimum number of queries possible.

If there had only been a few words to check against the database, I'd build the query using IN() to retrieved all of the legit words.

 

PS: Moved this to the "Application design" section, as this is about application design and not actual code assistance.

Edited by Christian F.
Link to comment
Share on other sites

> build a loop which groups the 1-14 letterArray together in all possible combinations:

 

Know that such a task is not really practical, especially with an upper limit of 14 letters.  With 14 letters you're looking at 14-factorial combinations (roughly 87 billion). Aside from the time it would take to generate that many combinations, in order to store all of them if you wanted to do a query would cost you 1.1 TB of space.

 

Querying all the dictionary words and checking against them as you generate each combination would cut down on the memory requirement, but you'd still be looking at a fairly substantial time investment just to generate the combinations.

 

Something that would probably help to find possible words is to first check for any letter combinations that create common letter groupings, such as 'er', 'ing', 'tion', 'ed', etc and then search for dictionary words containing those combinations.  Then you can filter the dictionary words based on whether or not all the letters for the word are present in the available letters list.

 

 

Link to comment
Share on other sites

Hi Chris & Kicken,

 

thx for your feedback! In this case I'd go for a more array based solution instead of bashing the server with too many SQL requests.

The dictionary is around 7MB so it should be doable to put everything into the array and check it with isset().

 

Cheers any many thanks for your help!

Mike

Link to comment
Share on other sites

Hi xylex,

 

so you would recommend MongoDB ? The only reason for me to use MySql is because I know how to use it and it's already running on my server.

 

However I'm more than happy to try something new.

I just went through their website and it seems I'll need some more time to understand all the details of mapreduce & sharding (never heard this before...)

 

thanks for throwing this in!

Michael

Link to comment
Share on other sites

Yeah, MongoDB would work for this. Reason I'm suggesting it is that you're basically trying to do that with the earlier suggestions - get everything that looks like it could be a match (Map) - and then filter out just the ones that you want (Reduce). Doing it with just MySQL and PHP means that this process in single threaded/single instance orientated and won't scale well to huge datasets - might not be an issue in this specific use case, but if it's a learning project, might as well go big.

 

Mongo runs a single thread per instance in its map/reduce handling, hence the idea to shard it across multiple instances.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.