Jump to content

Phrasing only the content area of a page


cooldude832

Recommended Posts

I am trying to return only the content section of a page that is a "text" page, so the forum would not apply, like an about.com or wikipedia article is what i am trying to get.  On wiki I know I can easily do it by getting the div labeled "content", but what abotu pages that aren't labeled, is there any ideas out there?

Link to comment
Share on other sites

my new idea is this

 

I think this might work

 

Use strip_tags on the file_get_contents, but some how preserver all the divs, tables,tr, tds (the container elements)

 

then count the number of words in each element using the opener/closer tags so to speak

 

the container with the greatest nubmer of words is the said "content", and then simply find that text and I got it, make sense?  my current issue then is how in the hell do I strip all tags but <div> <table><tr><td></div></table></tr></td>

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.