Jump to content

Parsing out the "Meaningful" Part of a Page


sug15

Recommended Posts

Here's an interesting question. Let's say I want to have a bot go to a page like http://www.phpfreaks.com/tutorial/php-add-text-to-image and find the only "meaningful" part of a page. For example, it would be a bot-like program and parse out raw text like this:

One of the standard features of a message board is allowing members to have a signature, which is appended to the bottom of each post they make. Posters can put whatever they want into the signature (within forum settings). Putting quotes in one's signature is one of the more popular things to do

 

...etc..etc...

 

On the other hand, there is a lot to explore and discuss about the specifics of captcha, so who knows, maybe I will. Until then,

 

Happy Coding!

 

Crayon Violent

 

So something like that, without finding the location of an RSS feed on the page or similar. Anyone have an idea how to do it? I'm not really looking for a full solution just ideas. My idea right now is that you would have to look for blocks of text where there are small amounts of HTML tags, and/or find what would appear to be the "main part" of a page by somehow analyzing the CSS. Any suggestions? Sorry, I know this may be a confusing question.

look into a php script called sphider, http://www.sphider.eu/

 

This looks interesting, ive been tinkering with a web search crawler for a little while now, this will give me a base to work off. Thanks!

I'll play with this a little... but not sure if it helps yet. I'll see though.

17. Users should not "bump" topics that are still on the first page of the forums. If you bump' date=' you must provide additional information. If you resort to bumping, chances are your question needs to be re-thought and re-described (see Eric Raymond's "How To Ask Questions The Smart Way").

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.