Jump to content

anyone built a content-based recommendation engine?


ohdang888
 Share

Recommended Posts

I'm about to start working on a content-based recommendation engine to find similar entries in my large database of news articles.

 

Do you think the best way of going about this is a simple mysql MATCH AGAINST query? or are there better ways to go about this?

 

btw my database has about 100,000 entries at the moment

 

Link to comment
Share on other sites

Do you think the best way of going about this is a simple mysql MATCH AGAINST query

Absolutely not.

 

If you want to use open source technology then have a look at Sphinx or Lucene. They are full text search indexes. Just Google or search the forum, i've posted the links up many times.

Link to comment
Share on other sites

Do you think the best way of going about this is a simple mysql MATCH AGAINST query

Absolutely not.

 

If you want to use open source technology then have a look at Sphinx or Lucene. They are full text search indexes. Just Google or search the forum, i've posted the links up many times.

 

Thanks man. I'll look into it.

 

I mean, just to make sure before i start: the best way (thats feasible w/out a huge R&D budget) to find relevant articles would be a FULL TEXT search engine right?

 

Thanks,

Link to comment
Share on other sites

Yes. Related articles are best done with a token based query. When you submit an article write a script that finds word occurances and store the most common against that article. i.e If I write an article on golf then the words, 'golf', 'tee', 'fairway' are likely to have a high number of occurances. These are your tokens. If I submit another article on golf then that article will also realte to the same tokens if it has matching high word occurances.

 

So your database:

 

articles

=====

articleId

title

body

 

1, Golf Article 1, This is the body containing the word golf. Golf is great. I love playing golf.......

1, Golf Article 2, Golf courses around the world. The best fairways. The fairways are very good on this golf course......

 

tokens

=====

tokenId

token

 

1, golf

2, fairways

 

tokenToArticle

==========

id

tokenId

articleId

 

1,1,1

2,1,2

3,2,2

 

So you can see that tokenId 1 (golf) is related to both articles. Use a full text engine such as sphinx to create an index from your database. This will perform very fast searches to pull out your related articles and also return search results from a text search box on the website. Using mysql to search text is very slow and returns poor results.

Link to comment
Share on other sites

This thread is more than a year old.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.