Jump to content

Parsing out the "Meaningful" Part of a Page


Recommended Posts

Here's an interesting question. Let's say I want to have a bot go to a page like http://www.phpfreaks.com/tutorial/php-add-text-to-image and find the only "meaningful" part of a page. For example, it would be a bot-like program and parse out raw text like this:

One of the standard features of a message board is allowing members to have a signature, which is appended to the bottom of each post they make. Posters can put whatever they want into the signature (within forum settings). Putting quotes in one's signature is one of the more popular things to do

 

...etc..etc...

 

On the other hand, there is a lot to explore and discuss about the specifics of captcha, so who knows, maybe I will. Until then,

 

Happy Coding!

 

Crayon Violent

 

So something like that, without finding the location of an RSS feed on the page or similar. Anyone have an idea how to do it? I'm not really looking for a full solution just ideas. My idea right now is that you would have to look for blocks of text where there are small amounts of HTML tags, and/or find what would appear to be the "main part" of a page by somehow analyzing the CSS. Any suggestions? Sorry, I know this may be a confusing question.

look into a php script called sphider, http://www.sphider.eu/

 

This looks interesting, ive been tinkering with a web search crawler for a little while now, this will give me a base to work off. Thanks!

I'll play with this a little... but not sure if it helps yet. I'll see though.

17. Users should not "bump" topics that are still on the first page of the forums. If you bump' date=' you must provide additional information. If you resort to bumping, chances are your question needs to be re-thought and re-described (see Eric Raymond's "How To Ask Questions The Smart Way").
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.