Jump to content

Categorizing Text in MYSQL database based on multiple keywords


natasha_sharma

Recommended Posts

Friends,


 


Wishes for 2016!


 


Please help me with this request. I need to get it sorted for my PHD thesis.


 


I am using MYSQL and PHP (XIMPP - Localhost for reserach).


 


I have dumped more than half a million Corporate news for past 4 years for India in news_content field. Some of the samples of that field is as below:




1) With reference to the earlier letter dated December 18, 2015 in connection with the Scheme of Amalgamation between Digjam Limited and Digjam Textiles Limited ('the ompanies') and their respective creditors and shareholders. Digjam Ltd has informed BSE that as directed by the Hon'ble High Court of Gujarat vide their Order dated December 18, 2015, the enclosed newspaper notice is being published in the Newspapers.

2) BEML Ltd has informed BSE regarding a Press Release, titled "BEML bags Export Award".

3) KEI Industries Ltd has informed BSE regarding "Bagging of Orders / Notification of Awards (NOA) valuing Rs. 384.53 Crores (Ex-works) from Power Grid Corporation of India Limited (PGCIL)".

4) With reference to the earlier Press Release dated December 26, 2015 regarding "Srikalahasthi Pipes Limited bags orders of Rs.1047 Crores during December, 2015". rikalahasthi Pipes Ltd has now informed BSE that the value of the orders received was mentioned as Rs. 1,047 Crores in the caption of the Press Release instead of Rs. ,053 Crores. Srikalahasthi Pipes Ltd has now submitted to BSE a copy of the Revised Press Release titled "Srikalahasthi Pipes Limited bags orders of Rs. 1053 Crores uring December, 2015".

5) Steel Strips Wheels Ltd has informed BSE regarding "SSWL bags exclusive nomination for Mahindra’s Puddling and Vineyard tractor range".

6) Star Delta Transformers Ltd has informed BSE that the Extra Ordinary General Meeting (EGM) of the Company will be held on January 23, 2016.


For my research requirement, I have to dynamically categorize most of the news based on Keywords. But it's not as simple as i thought. Because, if i use one keyword only then I will mis-categorical lot of news. For exmaple: If i use word "order" in above 6 news then there are only 2,3,4 news which talks about getting a new "Order". Rest of the news are about either Court Order or "Extra Ordinary General". So there are false positive.


 


Also, another issue is, i want to use multiple keywords seperated by comma to categorise. So if any of the KWs are found then i can categorise.


 


Another thing is, I should be able to define the Negative KWs, which should not be present in news.


 


So, its now spinning my head. i am not able to think through on how to sort it out?


 


How to implement any solution at all on PHP and MySQL?


 


Any help Please???


 


Regards,


Natasha


Link to post
Share on other sites

I agree strongly with Barand.  Full text indexes are made for this sort of problem, and do a few things for you like ranking of results based on the words in your search, not to mention adding features that allow you to specify the importance of specific words.

 

There are also 3rd party text search specific engines like sphinx or solr that are typically used in the web world, even when the initial data set is in mysql.  Sphinx is the easiest to add to the mix given your workstation environment.  See http://sphinxsearch.com

 

In either case, as well as I understand your question, I would recommend full text indexing or the additional use of sphinx, unless the entire point of the project is to engineer a solution to these problems yourself, and in that case you should be looking at the text features and strategies those engines provide to figure out what you need to be considering in a solution.

Link to post
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.