Jump to content

Meta search engine - Aggregation techniques


BurgerBob

Recommended Posts

Hi everyone, wasn't sure where to post my problem but thought i'd try here.

 

I am starting work on a project - to implement a basic working version of a meta search engine which aggregates search results from 3 different search engines (Google cannot be one of these).

 

I am thinking that implementation through php/mysql is the best route to go down. However, i have a question about aggregating the returned results from all three engines (lets say Bing, Yahoo and yebol).

 

I'm currently reading about several aggregation methods for the web... mainly Borda's position methods, Foot rule/scaled foot rule and Markov Chain methods. Does anyone have any experience with these... Borda's count seems most straight forward... can anyone give a general outline of how one would implement this technique in a meta search engine?

 

Thanks

 

Link to comment
Share on other sites

To be honest..I'm not familiar with any of those methods.

 

I take it you are doing more of a live results, and not saving any data into a database.

 

I would have to say pull the results into an array and show them from that.

Once in the array could sort,filter,separate any results and display how you would like.

 

I suppose could generate xml's and display this data to users , so you don't have to keep connecting to the search engines as often.

 

It could almost be easier to create your very own search engine.

There are many websites that already do as you plan.

Link to comment
Share on other sites

Hey, thanks for reply QuickOldCar!

 

It's actually a college project - to implement a working meta search engine.

 

Part of the specification is to have a query preprocessing module - tokenisation, stopword removal and stemming

 

and complex search ability - boolean search... AND, OR, NOT and keyword search ability.

 

The important part however (of any meta search engine) is the aggregation method - let's say we have three 'natural' engines and we save the top 100 query results from each engine in a database... how do we go about aggregating the results into a single list before displaying them back to the user... how do we know if a document in one engine is more relevant (to the user) than another document in another engine?

 

The three techniques mentioned in previous post seem to be most commonplace.

 

Most of this stuff is all new to me, coding languages etc... so i'm mostly learning as i go. Just trying to get a heads up on this particular issue as it's one of the more important elements of the project.

Link to comment
Share on other sites

I myself use boolean mode with match against in full text using MyISAM.

In order to use full text search must use MyISAM and also should create indexes on any where,and,or select statement values.

Be aware that on insert the table will be locked even for fetching, this is where caching the results and pages helps greatly.

 

The manual can explain much better than I can.

http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html

 

If saving to a database is no need to aggregate them as you say.

The results can be displayed back in any order you desire using common mysql select statements.

 

I use multiple mysql queries with if/elseif depending on which type of advanced search the user selects.

starts with characters, contains characters, least one word, exact phrase, one word, exclude word

 

Lets break down what we want here.

tokenisation:

basically this will be grouping words or phrases of the content

Do you need the users input to change as a sort of suggestion? I'll tell you right now this will not be easy.

But sure...you can merely break up many results in your database with the before word,the word, and then the after word and display them as suggestions or as groups.

 

My boolean searches look like this.

I break down each section as well to just search those too.

$post_status is a dynamic where for admin view that I can select either all,publish,pending,or banned to see all results, users just see published.

others are self explanatory, LIMIT is used for my pagination along with the count.

if ($search == "Date") {
$result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_date) AGAINST ('\"$search_words\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_date) AGAINST ('\"$search_words\"' IN BOOLEAN MODE)"); 
} elseif ($search == "ID") {
$result = mysql_query("SELECT * FROM posts $post_status AND ID LIKE '".$search_words."%' ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND ID LIKE '".$search_words."%'");
} elseif ($search == "url_begins_characters") {
    $result = mysql_query("SELECT * FROM posts $post_status AND post_title LIKE '".$search_words."%' ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND post_title LIKE '".$search_words."%'");
} elseif ($search == "url_contains_characters") {
    $result = mysql_query("SELECT * FROM posts $post_status AND post_title LIKE '%"."$search_words"."%' ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND post_title LIKE '%"."$search_words"."%'");
} elseif ($search == "feed_single_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND link_rss LIKE '%"."$search_words"."%' ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND link_rss LIKE '%"."$search_words"."%'");
} elseif ($search == "one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "exact_words") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('\"$search_words\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('\"$search_words\"' IN BOOLEAN MODE)");   
} elseif ($search == "least_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "exclude_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_title,post_content) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE)");
} elseif ($search == "title_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "title_exact_words") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('\"$search_words\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('\"$search_words\"' IN BOOLEAN MODE)");   
} elseif ($search == "title_least_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "title_exclude_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (title_2) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE)");
} elseif ($search == "description_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "description_exact_words") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('\"$search_words\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('\"$search_words\"' IN BOOLEAN MODE)");   
} elseif ($search == "description_least_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "description_exclude_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_description) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE)");
} elseif ($search == "keyword_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "keyword_exact_words") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('\"$search_words\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('\"$search_words\"' IN BOOLEAN MODE)");   
} elseif ($search == "keyword_least_one_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('$search_words' IN BOOLEAN MODE)");
} elseif ($search == "keyword_exclude_word") {
    $result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
    $total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_keywords) AGAINST ('+$search_words -$search_words' IN BOOLEAN MODE)");
} else {

//if anything goes wrong above or nothing selected, this will be used as the default query instead

//just show last 10 by id main page
if ($url == "http://get.blogdns.com/dynaindex/index.php" OR $url == "http://dynaindex.com/index.php") {
$result = @mysql_query("SELECT * FROM posts ORDER BY ID DESC LIMIT 0,10");
//$result = mysql_query("SELECT * FROM posts $post_status ORDER BY ID DESC LIMIT 0,10");
$total_count = $result;
} else {
//todays results new and updated
$result = mysql_query("SELECT * FROM posts $post_status AND post_date LIKE '".$today_date."%' ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND post_date LIKE '".$today_date."%'");
}

/*
$result = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_date) AGAINST ('\"$today_date\"' IN BOOLEAN MODE) ORDER BY $display $order LIMIT $startrow,$posts_per_page" );
$total_count = mysql_query("SELECT * FROM posts $post_status AND MATCH (post_date) AGAINST ('\"$today_date\"' IN BOOLEAN MODE)"); 
*/

}

 

stopword removal:

http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html

there is a default list already, and can edit this

stopwords are merely ignored in the searches but can be present in the results

 

you may also want to do filtering:

you can have an array or read from a list of any bad words or phrases do not want.

with this information you have choices, do not show this result, or edit the stop word with something else, in essence...filtering.

 

Here's how I filter when adding sites and links to my index.

I have a post_status in mysql with 3 values, publish,pending and banned

Upon adding links I have a bad words or phrases text file, the url,title,description,keywords are checked against it and depending on if is in the list gets set to publish or pending.

I also have a banned text file as well, if it's in there will be banned.

 

Of course this takes time and usage from getting many links.

I have an alternate method too. Merely not display the result if is in a bad-phrases list, or can even mask the word with another word using str_ireplace on output.

 

AND, OR, NOT and keyword search ability:

+ stands for AND

- stands for NOT

[no operator] implies OR

> <  changes a word's contribution to the relevance value that is assigned to a row

( ) group words into subexpressions

~ lowers the words relevence

* adds wildcard (you will not get very nice results with this, but is useful for some searches)

 

Now after all I said above......

You may be better off using cassandra as a database and python for the searching.

http://cassandra.apache.org/

Cassandra is in use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX, and more companies that have large, active data sets. The largest production cluster has over 100 TB of data in over 150 machines.

 

https://bitbucket.org/mchaput/whoosh/wiki/Home

Link to comment
Share on other sites

QuickOldCar... that's a lot of really good info there... and some food for thought... thanks!

 

You're post breaks some of the elements down and makes project look a little more manageable and has given me more of a steer on things.

 

Just starting to get stuck into it now.... i'm sure i'll have lots of problems/questions along the way for everyone on phpfreaks.

 

Thanks dude :)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.