Jump to content

Getting the text from a web page


tibberous

Recommended Posts

I need to take a web page and parse the text from the site. I am thinking of using regex to kill the script and style sections, then strip tags or more regex to kill the reset of the non-text. I need to have the text categorized as sentences, so I some how have to keep track of how it is grouped with other text.

 

I have two questions:

 

1) What do you guys think is the best way to do this? Regex, explodes, iteration, building a DOM structure, callbacks?

 

2) What would the regex look like if I wanted to find <, then any number of whitespace chatarers, then the word script, then anything except for (<, possible whitespace, /script, and then anything until >). Put another way, how can I match the style section of a document?

Link to comment
Share on other sites

Here is what I am thinking:

 

1) Validate the HTML with Tidy, this should get rid of 90% of the error checking needed.

 

2) Wipe of the script and style sections.

 

3) Wipe of comments.

 

4) Remove tags that are displayed inline, as there are probably not dividing content (span, font, b, u, i, em, big, q, small, sub, sup, strong)

 

5) Remove any tag that is not a container (containers: blockquote, body, center, div, h1-h6, li, p, pre, td, textarea, title)

 

6) Convert <br> into \n or ' '.

 

7) Grab all the container tags, starting with the inner most. So, if you had <body><div>Hello there</div></body>, it would grab the div containing Hello there, then grab body containing nothing.

 

I think this should work. Right now I am having a hell of time though:

 

Tidy isn't working like how the docs say it should, I think I have the wrong version, and I can't use the version for PHP 5 because this has to run on PHP 4. Also, if I can get it to work on my local apache server, it will break when I put it on my linux server. If there are Tidy alternatives, I would be interested in hearing them.

 

I have forgotten regex, and when I learned regex it was for Perl, which I remember being different.

 

If anyone has any comments or suggestions, please let me know. If you know how to get Tidy to work, REALLY please let me know :)

Link to comment
Share on other sites

see this thread ,, it may help you .

http://www.phpfreaks.com/forums/index.php/topic,151200.msg652402.html#msg652402

 

Thats kind of what I want, just mine has to work with an given web page and therefore needs to do a ton of error handling, as well as parsing the entire document.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.