Jump to content

fulltext searching on html formatted contents


mattgleeson

Recommended Posts

Hi

 

I need to create a search engine for my website. It will search page content stored in the MySQL database. The content stored in the database is pre-formatted with html and when I do a fulltext search against the content I want it to ignore matches of search words which are enclosed in tags (i.e. html tag keywords). I.e. if someone does a search for 'style', I don't want it to match instances of 'style' enclosed in html tags < > like <div style="abc"> etc.

Row 1 contents might be '<div><p>this years style is ...</p></div>'

Row 2 contents might by '<div style="abc"><p>this years fashion is ...</p></div>'

 

I want the fulltext search to match row 1 but not row 2.

 

Is this possible to do this with fulltext searches and if not, can you give me advice on how to search html formatted text stored in the DB without matching html tags (if the user happened to search for a word that is also a html tag)

 

Many thanks in advance

Thanks for your reply. Unfortunately I already store the contents twice in the database and the database has the potential to grow big so storing the content a third time in the DB isn't really an option.

The second copy of the content I store in the DB isn't html formatted but it is split up into different sections and the table uses InnoDB so I can't do  full text searching on it.

Well, I don't think it will be possible to successfully perform a search in MySQL that ignores HTML, and if it is, it will probably be resource-intensive.  Once you cut out ALL of the HTML from the average webpage, what remains is usually not that big, so the content-only solution may not be as large as you think.  Of course, if it definitely won't meet your needs, you could also make your own "common word" filter and store only unusual words or phrases in a seperate column for text searching, although that will break up content and might foil quoted searches ("this and that" if you've removed all of the conjunctions from the text, won't work).

 

That's how'd I might do it, anyway.  Good luck.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.