christianbale Posted February 19, 2012 Share Posted February 19, 2012 Hi all, I need help in parsing a htm documents. Please find my code below <? function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before and after blocks '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text ); // strip_tags removes the remaining html tags return strip_tags( $text); } function strip_random_characters( $text ) { //This function removes all the rest of the special characters $special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text); //$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text); $data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters ); return $data; } function upper_to_lower($string) { //This function converts everything to lower case... $doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string); return $doc; } This code removes html tags completely. But still I'm getting special characters in the processed document. " ⠲ â ³n  ⠲ â ³wï ï  n  wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters. Any help will be appreciated! Thanks Quote Link to comment https://forums.phpfreaks.com/topic/257330-need-help-in-parsing-htm-documents/ Share on other sites More sharing options...
codebyren Posted February 19, 2012 Share Posted February 19, 2012 Just as a side note, you could look at htmlpurifier for removing html tags etc. I believe it's very thorough. Then, what's wrong with PHP's strtolower function for converting characters to lower case? Finally, do you have a sample document? I don't see why a single preg_replace() with a whitelist of characters shouldn't work and am curious to try. Quote Link to comment https://forums.phpfreaks.com/topic/257330-need-help-in-parsing-htm-documents/#findComment-1319022 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.