Full text index relevancy issue

kickstart · September 7, 2012

Hi

I have a full text search I am trying to use to search against a string of important search terms. However the ranking of the results is a bit strange.

For example, search for "d-link router" against this column it is bringing back a fair few rows, but ranks a row containing tp-link but not d-link higher than one that contains d-link.

If, this row is ranked 9.4198112487793

Routers-and-Switches TP-Link TL-MR3220 TP-TL-MR3220 ROUTER tlw&tlw tlwAVtlw BUNDLE tlw3Gtlw N-LITE ADSL ROUTER tlw&tlw tlw1YRtlw BULLGUARD tlwAVtlw TP-LINK TP-Link TL-MR3220 3G/3.75G 150Mbps Wireless Lite tlwNtlw Router 6935364051501

while this row is ranked 8.55044555664062

Routers-and-Switches D-Link DSL-2680/UK DL-DSL-2680 D-LINK ADSL ROUTER WIRELESS tlwNtlw tlw150tlw ADSL2+ ROUTER DLINK D-Link DSL-2680 Wireless tlwNtlw tlw150tlw ADSL2+ Modem Router 790069334535

The match statement is as follows:-

SELECT item_keyword_search, MATCH (item_keyword_search) AGAINST ('d-link* router*' )
FROM item_import
AND MATCH (item_keyword_search) AGAINST ('d-link* router*' )

Eliminating the * wildcards doesn't change this, nor does splitting the words with a comma.

Any suggestions?

All the best

Keith

The Little Guy · September 7, 2012

putting a match in the field list is useless, unless you want to sort by it.

SELECT item_keyword_search, (MATCH (item_keyword_search) AGAINST ('d-link* router*' IN BOOLEAN MODE)) as score 
FROM item_import 
WHERE MATCH (item_keyword_search) AGAINST ('d-link* router*' IN BOOLEAN MODE) order by score desc

kickstart · September 7, 2012

Hi

I do want to be able to sort them, but it is also useful to see how it is rating matches.

Problem appears to be that match assumes a hyphen separates words. Also it ignores words less than 4 characters long so D-LINK and TP-LINK are taken as being the same.

All the best

Keith

The Little Guy · September 7, 2012

if you have access to the config file;

ft_min_word_len = 3

If a word is specified with the truncation operator' date=' it is not stripped from a boolean query, even if it is too short (as determined from the ft_min_word_len setting) or a stopword. This occurs because the word is not seen as too short or a stopword, but as a prefix that must be present in the document in the form of a word that begins with the prefix. Suppose that ft_min_word_len=4. ft_min_word_len=4. Then a search for '+word +the*' will likely return fewer rows than a search for '+word +the'[/quote']

Possibility:

Modify a character set file: This requires no recompilation. The true_word_char() macro uses a ?character type? table to distinguish letters and numbers from other characters. . You can edit the <ctype><map> contents in one of the character set XML files to specify that '-' is a ?letter.? Then use the given character set for your FULLTEXT indexes.

xyph · September 7, 2012

Hi

I do want to be able to sort them, but it is also useful to see how it is rating matches.

Problem appears to be that match assumes a hyphen separates words. Also it ignores words less than 4 characters long so D-LINK and TP-LINK are taken as being the same.

All the best

Keith

Get dat Sphinx?!

fenway · September 8, 2012

Yeah, FT is mysql is rather limited -- you'll have to mess with internals to trick it into using a hyphen a part of a word.

I'd vote for Sphinx, too.

Sign In

Full text index relevancy issue

Recommended Posts

kickstart

Link to comment

Share on other sites

The Little Guy

Link to comment

Share on other sites

kickstart

Link to comment

Share on other sites

The Little Guy

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information