Need help in parsing htm documents

christianbale · February 19, 2012

Hi all,

I need help in parsing a htm documents. Please find my code below

<?

function strip_html_tags( $text )

{

$text = preg_replace(

array(

// Remove invisible content

'@<head[^>]*?>.*?</head>@siu',

'@<style[^>]*?>.*?</style>@siu',

'@<script[^>]*?.*?</script>@siu',

'@<object[^>]*?.*?</object>@siu',

'@<embed[^>]*?.*?</embed>@siu',

'@<applet[^>]*?.*?</applet>@siu',

'@<noframes[^>]*?.*?</noframes>@siu',

'@<noscript[^>]*?.*?</noscript>@siu',

'@<noembed[^>]*?.*?</noembed>@siu',

// Add line breaks before and after blocks

'@</?((address)|(blockquote)|(center)|(del))@iu',

'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

'@</?((table)|(th)|(td)|(caption))@iu',

'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',

'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

'@</?((frameset)|(frame)|(iframe))@iu',

),

array(

' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );

// strip_tags removes the remaining html tags

return strip_tags( $text);

}

function strip_random_characters( $text )

{

//This function removes all the rest of the special characters

$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);

//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);

$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );

return $data;

}

function upper_to_lower($string) {

//This function converts everything to lower case...

$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);

return $doc;

}

This code removes html tags completely. But still I'm getting special characters in the processed document. "Â â ² â ³n Â â ² â ³wï ï Â n Â wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters.

Any help will be appreciated! Thanks

codebyren · February 19, 2012

Just as a side note, you could look at htmlpurifier for removing html tags etc. I believe it's very thorough.

Then, what's wrong with PHP's strtolower function for converting characters to lower case?

Finally, do you have a sample document? I don't see why a single preg_replace() with a whitelist of characters shouldn't work and am curious to try.

Sign In

Need help in parsing htm documents

Recommended Posts

christianbale

Link to comment

Share on other sites

codebyren

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information