Jump to content

categorising and linking texts, howto?


surion

Recommended Posts

to explain what i m looking for i will give you an example

 

 

a site reads rss feeds from numerous news-sites. and one day, it gets the next articles in from the feeds:

-2 articles about that hurricane

-3 articles about the elections in america

 

what i want the system to do is the next thing:

 

the system should (based on the text content of the articles)

-detect AUTOMATICALY that the hurricane articles are related, and same goes for the articles about the elections.

-the system should be able to automaticaly categorize those articles in for example "politics" and "disasters"

 

 

i know something like this exists already, and i want to build something like that myself, yet i don't know WHAT algorithms i should look for to properly get started,...

 

does anyone know how i could start this? any links to articles? anyone experienced with this? i m not asking for code, text that could explain me how to start would be very great already

 

thanks in advance

 

surion

Link to comment
https://forums.phpfreaks.com/topic/122288-categorising-and-linking-texts-howto/
Share on other sites

Not sure really and maybe someone can give better advice. Just my 2 cents.

 

What I can think of is categorizing based on keywords. As soon as you have an article from feeds, the script searches for keywords and finds which ones are being used most. So if two articles have enough keywords in common, it means they are related. If the article contains repeated keywords like "hurricane" or "twister" it belongs to the "disasters" category. Common words like "to, and, I, you, etc" may be excluded from the search/comparison. Basically it may be a system based on keywords and the time they occur, to determine related articles, and keyword comparison against a predefined list to determine the category it belongs.

 

Hope this helps a bit.

Not sure really and maybe someone can give better advice. Just my 2 cents.

 

What I can think of is categorizing based on keywords. As soon as you have an article from feeds, the script searches for keywords and finds which ones are being used most. So if two articles have enough keywords in common, it means they are related. If the article contains repeated keywords like "hurricane" or "twister" it belongs to the "disasters" category. Common words like "to, and, I, you, etc" may be excluded from the search/comparison. Basically it may be a system based on keywords and the time they occur, to determine related articles, and keyword comparison against a predefined list to determine the category it belongs.

 

Hope this helps a bit.

 

I second that unless you want to write an algorithm to compete with pagerank. You may be able to do something on a rank based system using fulltext http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html but it's going to require ALOT of knowledge and time.

 

Stick to the keyword idea its simple and effective.

@GuiltyGear

very good thinking i guess. and based on what you say, i m also thinking about analysing not only based on words, but also on "word groups" (like names for examle are mostly more than 1 word). ll be hard tough to make this when thinking about performance i guess,... doens't string comparison take alot of resources especialy when comparising large lists with large lists?

 

@knowj

looks like a very intresting article, i'll take a look at that when i wake up 2morow :) (late night here right now), and i don't realy worry about knowledge :) been programming php for a very long time now, but so far everything i made is always "some kind of" custom CMS system (jobsites, ecommerce sites,...),... not very hard to make. the reason why i m looking for this is because i m looking for "some new challenge" :)

 

thanks alot to both of you for thinking with me

 

 

 

The string comparison will be resource consuming on large texts, that's for sure. The fulltext searching capabilities of MySQL are definitely a better approach to the same idea. You can use it to count keywords and such, even though I have no idea how would you go for it. It's an unexplored subject from me :)

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.