christianbale Posted February 19, 2012 Share Posted February 19, 2012 Hi all, I need help in parsing a htm documents. Please find my code below <? function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before and after blocks '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text ); // strip_tags removes the remaining html tags return strip_tags( $text); } function strip_random_characters( $text ) { //This function removes all the rest of the special characters $special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text); //$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text); $data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters ); return $data; } function upper_to_lower($string) { //This function converts everything to lower case... $doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string); return $doc; } This code removes html tags completely. But still I'm getting special characters in the processed document. " ⠲ â ³n  ⠲ â ³wï ï  n  wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters. Any help will be appreciated! Thanks Link to comment https://forums.phpfreaks.com/topic/257330-need-help-in-parsing-htm-documents/ Share on other sites More sharing options...
codebyren Posted February 19, 2012 Share Posted February 19, 2012 Just as a side note, you could look at htmlpurifier for removing html tags etc. I believe it's very thorough. Then, what's wrong with PHP's strtolower function for converting characters to lower case? Finally, do you have a sample document? I don't see why a single preg_replace() with a whitelist of characters shouldn't work and am curious to try. Link to comment https://forums.phpfreaks.com/topic/257330-need-help-in-parsing-htm-documents/#findComment-1319022 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.