Jump to content

Need help in parsing htm documents


christianbale

Recommended Posts

Hi all,

I need help in parsing a htm documents. Please find my code below

 

<?

 

function strip_html_tags( $text )

{

    $text = preg_replace(

        array(

          // Remove invisible content

            '@<head[^>]*?>.*?</head>@siu',

            '@<style[^>]*?>.*?</style>@siu',

            '@<script[^>]*?.*?</script>@siu',

            '@<object[^>]*?.*?</object>@siu',

            '@<embed[^>]*?.*?</embed>@siu',

            '@<applet[^>]*?.*?</applet>@siu',

            '@<noframes[^>]*?.*?</noframes>@siu',

            '@<noscript[^>]*?.*?</noscript>@siu',

            '@<noembed[^>]*?.*?</noembed>@siu',

          // Add line breaks before and after blocks

            '@</?((address)|(blockquote)|(center)|(del))@iu',

            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

            '@</?((table)|(th)|(td)|(caption))@iu',

            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',

            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

            '@</?((frameset)|(frame)|(iframe))@iu',

        ),

        array(

            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );

    // strip_tags removes the remaining html tags   

    return strip_tags( $text);

}

 

function strip_random_characters( $text )

{

//This function removes all the rest of the special characters

 

$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);

 

//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);

 

 

$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );

 

return $data;

}

 

 

function upper_to_lower($string) {

 

//This function converts everything to lower case...

$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);

 

return $doc;

}

 

 

This code removes html tags completely. But still I'm getting special characters in the processed document. "  â ² â ³n    â ² â ³wï    ï       n           wï lpg=pa    amp dq=%  only+fjord+on+the+east+coast%  v=onepage amp q=%  only%  fjord%  on% " are the some of the characters in the processed document. I want to remove these special characters.

 

Any help will be appreciated! Thanks

 

 

 

Link to comment
https://forums.phpfreaks.com/topic/257330-need-help-in-parsing-htm-documents/
Share on other sites

Just as a side note, you could look at htmlpurifier for removing html tags etc.  I believe it's very thorough.

 

Then, what's wrong with PHP's strtolower function for converting characters to lower case?

 

Finally, do you have a sample document?  I don't see why a single preg_replace() with a whitelist of characters shouldn't work and am curious to try.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.