sug15 Posted September 16, 2009 Share Posted September 16, 2009 Here's an interesting question. Let's say I want to have a bot go to a page like http://www.phpfreaks.com/tutorial/php-add-text-to-image and find the only "meaningful" part of a page. For example, it would be a bot-like program and parse out raw text like this: One of the standard features of a message board is allowing members to have a signature, which is appended to the bottom of each post they make. Posters can put whatever they want into the signature (within forum settings). Putting quotes in one's signature is one of the more popular things to do ...etc..etc... On the other hand, there is a lot to explore and discuss about the specifics of captcha, so who knows, maybe I will. Until then, Happy Coding! Crayon Violent So something like that, without finding the location of an RSS feed on the page or similar. Anyone have an idea how to do it? I'm not really looking for a full solution just ideas. My idea right now is that you would have to look for blocks of text where there are small amounts of HTML tags, and/or find what would appear to be the "main part" of a page by somehow analyzing the CSS. Any suggestions? Sorry, I know this may be a confusing question. Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/ Share on other sites More sharing options...
sug15 Posted September 16, 2009 Author Share Posted September 16, 2009 Bump. Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919309 Share on other sites More sharing options...
dennismonsewicz Posted September 16, 2009 Share Posted September 16, 2009 look into a php script called sphider, http://www.sphider.eu/ Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919311 Share on other sites More sharing options...
dreamwest Posted September 16, 2009 Share Posted September 16, 2009 look into a php script called sphider, http://www.sphider.eu/ This looks interesting, ive been tinkering with a web search crawler for a little while now, this will give me a base to work off. Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919315 Share on other sites More sharing options...
sug15 Posted September 16, 2009 Author Share Posted September 16, 2009 look into a php script called sphider, http://www.sphider.eu/ This looks interesting, ive been tinkering with a web search crawler for a little while now, this will give me a base to work off. Thanks! I'll play with this a little... but not sure if it helps yet. I'll see though. Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919678 Share on other sites More sharing options...
sug15 Posted September 16, 2009 Author Share Posted September 16, 2009 Final bump. Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919815 Share on other sites More sharing options...
MadTechie Posted September 16, 2009 Share Posted September 16, 2009 17. Users should not "bump" topics that are still on the first page of the forums. If you bump' date=' you must provide additional information. If you resort to bumping, chances are your question needs to be re-thought and re-described (see Eric Raymond's "How To Ask Questions The Smart Way"). Quote Link to comment https://forums.phpfreaks.com/topic/174398-parsing-out-the-meaningful-part-of-a-page/#findComment-919817 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.