Getting the text from a web page

tibberous · August 6, 2007

I need to take a web page and parse the text from the site. I am thinking of using regex to kill the script and style sections, then strip tags or more regex to kill the reset of the non-text. I need to have the text categorized as sentences, so I some how have to keep track of how it is grouped with other text.

I have two questions:

1) What do you guys think is the best way to do this? Regex, explodes, iteration, building a DOM structure, callbacks?

2) What would the regex look like if I wanted to find <, then any number of whitespace chatarers, then the word script, then anything except for (<, possible whitespace, /script, and then anything until >). Put another way, how can I match the style section of a document?

php new bie · August 6, 2007

see this thread ,, it may help you .

http://www.phpfreaks.com/forums/index.php/topic,151200.msg652402.html#msg652402

tibberous · August 6, 2007

Here is what I am thinking:

1) Validate the HTML with Tidy, this should get rid of 90% of the error checking needed.

2) Wipe of the script and style sections.

3) Wipe of comments.

4) Remove tags that are displayed inline, as there are probably not dividing content (span, font, b, u, i, em, big, q, small, sub, sup, strong)

5) Remove any tag that is not a container (containers: blockquote, body, center, div, h1-h6, li, p, pre, td, textarea, title)

6) Convert <br> into \n or ' '.

7) Grab all the container tags, starting with the inner most. So, if you had <body><div>Hello there</div></body>, it would grab the div containing Hello there, then grab body containing nothing.

I think this should work. Right now I am having a hell of time though:

Tidy isn't working like how the docs say it should, I think I have the wrong version, and I can't use the version for PHP 5 because this has to run on PHP 4. Also, if I can get it to work on my local apache server, it will break when I put it on my linux server. If there are Tidy alternatives, I would be interested in hearing them.

I have forgotten regex, and when I learned regex it was for Perl, which I remember being different.

If anyone has any comments or suggestions, please let me know. If you know how to get Tidy to work, REALLY please let me know

tibberous · August 6, 2007

see this thread ,, it may help you .
http://www.phpfreaks.com/forums/index.php/topic,151200.msg652402.html#msg652402

Thats kind of what I want, just mine has to work with an given web page and therefore needs to do a ton of error handling, as well as parsing the entire document.

Crew-Portal · August 6, 2007

Open the document in dreamweaver. Go to design and copy all the text. Then post this text into notpad which will remove all style. Save it close it and copy and paste after reopen into Microsoft word if you want to edit what the text looks like!

Sign In

Getting the text from a web page

Recommended Posts

tibberous

Link to comment

Share on other sites

php new bie

Link to comment

Share on other sites

tibberous

Link to comment

Share on other sites

tibberous

Link to comment

Share on other sites

Crew-Portal

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information