tibberous Posted August 6, 2007 Share Posted August 6, 2007 I need to take a web page and parse the text from the site. I am thinking of using regex to kill the script and style sections, then strip tags or more regex to kill the reset of the non-text. I need to have the text categorized as sentences, so I some how have to keep track of how it is grouped with other text. I have two questions: 1) What do you guys think is the best way to do this? Regex, explodes, iteration, building a DOM structure, callbacks? 2) What would the regex look like if I wanted to find <, then any number of whitespace chatarers, then the word script, then anything except for (<, possible whitespace, /script, and then anything until >). Put another way, how can I match the style section of a document? Quote Link to comment https://forums.phpfreaks.com/topic/63507-getting-the-text-from-a-web-page/ Share on other sites More sharing options...
php new bie Posted August 6, 2007 Share Posted August 6, 2007 see this thread ,, it may help you . http://www.phpfreaks.com/forums/index.php/topic,151200.msg652402.html#msg652402 Quote Link to comment https://forums.phpfreaks.com/topic/63507-getting-the-text-from-a-web-page/#findComment-316545 Share on other sites More sharing options...
tibberous Posted August 6, 2007 Author Share Posted August 6, 2007 Here is what I am thinking: 1) Validate the HTML with Tidy, this should get rid of 90% of the error checking needed. 2) Wipe of the script and style sections. 3) Wipe of comments. 4) Remove tags that are displayed inline, as there are probably not dividing content (span, font, b, u, i, em, big, q, small, sub, sup, strong) 5) Remove any tag that is not a container (containers: blockquote, body, center, div, h1-h6, li, p, pre, td, textarea, title) 6) Convert <br> into \n or ' '. 7) Grab all the container tags, starting with the inner most. So, if you had <body><div>Hello there</div></body>, it would grab the div containing Hello there, then grab body containing nothing. I think this should work. Right now I am having a hell of time though: Tidy isn't working like how the docs say it should, I think I have the wrong version, and I can't use the version for PHP 5 because this has to run on PHP 4. Also, if I can get it to work on my local apache server, it will break when I put it on my linux server. If there are Tidy alternatives, I would be interested in hearing them. I have forgotten regex, and when I learned regex it was for Perl, which I remember being different. If anyone has any comments or suggestions, please let me know. If you know how to get Tidy to work, REALLY please let me know Quote Link to comment https://forums.phpfreaks.com/topic/63507-getting-the-text-from-a-web-page/#findComment-316550 Share on other sites More sharing options...
tibberous Posted August 6, 2007 Author Share Posted August 6, 2007 see this thread ,, it may help you . http://www.phpfreaks.com/forums/index.php/topic,151200.msg652402.html#msg652402 Thats kind of what I want, just mine has to work with an given web page and therefore needs to do a ton of error handling, as well as parsing the entire document. Quote Link to comment https://forums.phpfreaks.com/topic/63507-getting-the-text-from-a-web-page/#findComment-316553 Share on other sites More sharing options...
Crew-Portal Posted August 6, 2007 Share Posted August 6, 2007 Open the document in dreamweaver. Go to design and copy all the text. Then post this text into notpad which will remove all style. Save it close it and copy and paste after reopen into Microsoft word if you want to edit what the text looks like! Quote Link to comment https://forums.phpfreaks.com/topic/63507-getting-the-text-from-a-web-page/#findComment-316633 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.