Jump to content

[SOLVED] Finding 'needle' in '<div>haystack</div>'


Recommended Posts

Hi

I'm creating a site where people can make a post. Most HTML is allowed and saved to the database. I'm currently trying to create a search function for users to search the content of a post. I basically want the search to only look at what would be displayed when viewing a post, rather than the HTML formatting as well if that makes sense.

 

E.g -

 

The code of a post in the db:

 

<a name='blah' href='http://www.google.com'>Visit google</a>

 

The search for 'name' should not return the above post. but a search for 'Visit' or 'Google' should. I sort of need it to only search what a browser would render which I cant really imagine how to do.

 

Hope that makes sense, any pointers appreciated. Thanks,

Jibster

Link to comment
https://forums.phpfreaks.com/topic/115705-solved-finding-needle-in-haystack/
Share on other sites

There are three ways I can think of, and one I prefer for super fast searching.

 

First, you could search for the keywords, then eliminate the HTML tags, and have PHP search the string again to see if the result remains true.

 

Second, you could create and additional field in you MySQL that holds the content minus HTML codes. PHP would strip the HTML code before it is submitted to the database.

 

Third, and my favorite. PHP strips the html code, then indexes each word individually to a seperate table, which contains an index to all post ID's that have this word in it. So for example you have the word "visit".

 

So in the "QuickIndex" table it would look like this:

ID, WORD, MATCHES

1, "visit", 2,8,10,24

 

Basically this means that when I search for the word visit, all I have to do is search "QuickIndex". That immediately pulls up 4 results, result id 2, result id 8, 10, 24, etc. This system is a little more complicated to implement right, but is extremely quick and efficient. I used this when I made a large program that indexed over a terabyte of generic text. Saved an amazing amount of time in searching. In fact regular searching was impossible. This method made it quick because there are only so many possible matches.

Cheers for that, some good ideas. So for the last example, when you create a new post you would also need to add to the "QuickIndex" table each time? Does this slow down the adding process much?

 

Have you tried your second suggesition of having a different column with just the keywords in, sans the tags? That seems like an easy implementation and also would be pretty quick I'm thinking too.

 

Thanks again!

Adding those words won't add much time at all considering that the insertion process happens so fast it's not noticable. The second suggestion will take forever to search if you get any large amounts of data because MySQL will have to search through every result. The third one limits the possible results and eliminates the need for doing multiple queries if you do the query right. If your not planning on having a lot of data, just do number two.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.