Jump to content


Photo

most eficient way to do a dictionary search

dictionary words search

  • Please log in to reply
6 replies to this topic

#1 funkybeat

funkybeat

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 08 February 2013 - 05:30 AM

Hi phpFreaks!

To improve my php skills I do some private coding for my own.
I did some small php scripts in the past, however never worked with the "search/DB" topic.

This is where someone may help me a bit to give me a good start for my next project:

"A scrabble search engine based on a dictionary with php & mysql"
I'll try to do this as most efficient (fast) as possible.

Requirements:

> "field to enter random 1-14 letters" to search for words in the DB
> "Dictionary in Database" or in file but I dont think that may be faster...

My plan:

> Put every single word into a database table called "words" (EDIT: maybe I should add a second row and save the "word length" of each word as well)
> split up the 1-14 letters into a letter array
> build a loop which groups the 1-14 letterArray together in all possible combinations:
> 1st run: two-party groups -> check with "like" if its in the database for every single entry and put the result in a resultArray (EDIT: only check the words with the "word length" = 2)
> 2nd run: three-party groups -> check with "like" if its in the database for every single entry and attach the result to the resultArray (EDIT: only check the words with the "word length" = 3)
.....>14th run...
> output resultArray


I'm not looking for code parts, I'm just looking for a good concept to start with. Does someone have any recommendations on how I could do it better than the above?
In my eyes there are too many DB search requests and maybe there is a better way to do it?

Cheers
Mike

Edited by funkybeat, 08 February 2013 - 05:38 AM.


#2 Christian F.

Christian F.

    Advanced Member

  • Staff Alumni
  • 3,106 posts
  • LocationNorway

Posted 08 February 2013 - 06:03 AM

Assuming you're going to save each form of a word as a new word in itself, which would be easiest, instead of saving the root form and then having another table for the other forms/suffixes.

You don't need to use like at all, and this challenge becomes rather simple: Add an index on the word, preferably a unique, and you're pretty much set. Seeing as you want words of that specific length, and not words containing a certain letter combination.

Additionally, for the low number of words in the above database, and the high number of combinations possible, I'd probably just retrieve all of them first. Stuff them into an array as keys, and then do an check for isset () in the array that generates/checks the letter combinations. Always query with and for the least amount of data, using the minimum number of queries possible.
If there had only been a few words to check against the database, I'd build the query using IN() to retrieved all of the legit words.

PS: Moved this to the "Application design" section, as this is about application design and not actual code assistance.

Edited by Christian F., 08 February 2013 - 06:04 AM.

Keeping it simple.

#3 kicken

kicken

    Wiser? Not exactly.

  • Gurus
  • 2,679 posts
  • LocationBonita, FL

Posted 08 February 2013 - 08:34 AM

> build a loop which groups the 1-14 letterArray together in all possible combinations:


Know that such a task is not really practical, especially with an upper limit of 14 letters.  With 14 letters you're looking at 14-factorial combinations (roughly 87 billion). Aside from the time it would take to generate that many combinations, in order to store all of them if you wanted to do a query would cost you 1.1 TB of space.

Querying all the dictionary words and checking against them as you generate each combination would cut down on the memory requirement, but you'd still be looking at a fairly substantial time investment just to generate the combinations.

Something that would probably help to find possible words is to first check for any letter combinations that create common letter groupings, such as 'er', 'ing', 'tion', 'ed', etc and then search for dictionary words containing those combinations.  Then you can filter the dictionary words based on whether or not all the letters for the word are present in the available letters list.


Recycle your old CD's, don't trash them!
Did I help you out?  Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7

#4 funkybeat

funkybeat

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 11 February 2013 - 05:24 AM

Hi Chris & Kicken,

thx for your feedback! In this case I'd go for a more array based solution instead of bashing the server with too many SQL requests.
The dictionary is around 7MB so it should be doable to put everything into the array and check it with isset().

Cheers any many thanks for your help!
Mike

#5 xylex

xylex

    Advanced Member

  • Members
  • PipPipPip
  • 292 posts

Posted 12 February 2013 - 03:07 AM

This sounds like a problem for map-reduce across a few shards. Any reason you're picking MySQL?
The greatest inefficiencies come from solving problems you will never have.  -Rasmus

PHP Development Blog

#6 funkybeat

funkybeat

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 12 February 2013 - 08:17 AM

Hi xylex,

so you would recommend MongoDB ? The only reason for me to use MySql is because I know how to use it and it's already running on my server.

However I'm more than happy to try something new.
I just went through their website and it seems I'll need some more time to understand all the details of mapreduce & sharding (never heard this before...)

thanks for throwing this in!
Michael

#7 xylex

xylex

    Advanced Member

  • Members
  • PipPipPip
  • 292 posts

Posted 12 February 2013 - 04:02 PM

Yeah, MongoDB would work for this. Reason I'm suggesting it is that you're basically trying to do that with the earlier suggestions - get everything that looks like it could be a match (Map) - and then filter out just the ones that you want (Reduce). Doing it with just MySQL and PHP means that this process in single threaded/single instance orientated and won't scale well to huge datasets - might not be an issue in this specific use case, but if it's a learning project, might as well go big.

Mongo runs a single thread per instance in its map/reduce handling, hence the idea to shard it across multiple instances.
The greatest inefficiencies come from solving problems you will never have.  -Rasmus

PHP Development Blog




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

Cheap Linux VPS from $5
SSD Storage, 30 day Guarantee
1 TB of BW, 100% Network Uptime

AlphaBit.com