Jump to content

remove html but keep formatting


fife

Recommended Posts

I have a function that removes html, javascript and php from a text box that users of my site can write in.  The problem is when a user types

say;

 

hello.  My name is Danny

 

 

And I love php freaks

 

into my text box.  Its also removing the new line entry's. I would very much stop to stop it doing that so it looks like I have paragraphs.  Can someone please help me.  Here is how I call the function.

 

 $description = strip_word_html($_POST['description'], $allowed_tags = '<b><i><sup><sub><em><strong><u><br><br/><br />'); 

 

And the function itself with notes.

 

//remove html java and php
function strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><br/><br />') 
    { 
        mb_regex_encoding('UTF-8'); 
        //replace MS special characters first 
        $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u'); 
        $replace = array('\'', '\'', '"', '"', '-'); 
        $text = preg_replace($search, $replace, $text); 
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears 
        //in some MS headers, some html entities are encoded and some aren't 
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); 
        //try to strip out any C style comments first, since these, embedded in html comments, seem to 
        //prevent strip_tags from removing html comments (MS Word introduced combination) 
        if(mb_stripos($text, '/*') !== FALSE){ 
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm'); 
        } 
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be 
        //'<1' becomes '< 1'(note: somewhat application specific) 
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text); 
        $text = strip_tags($text, $allowed_tags); 
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one 
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text); 
        //strip out inline css and simplify style tags 
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu'); 
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>'); 
        $text = preg_replace($search, $replace, $text); 
        
        //some MS Style Definitions - this last bit gets rid of any leftover comments */ 
        $num_matches = preg_match_all("/\<!--/u", $text, $matches); 
        if($num_matches){ 
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text); 
        } 
        return $text; 
    } 


Link to comment
Share on other sites

Are you wrapping the $description in <pre> tags when you render it?  If not, I don't believe the newlines will be honoured (from a presentation perspective) even if they do happen to make it through your filter function.

 

Then, I haven't looked through your function but you may want to have a look at implementing htmlpurifier to handle this for you.  When I looked into doing this sort of thing a few years back it turned into a massive task and you never quite filter out everything you think you are.  May as well try a tool developed for the task.

Link to comment
Share on other sites

Are you sure that mammoth function is really worthwhile? I would think just using strip_tags() and htmlspecialchars() would be enough. It would leave any JS code that was between the opening/closing tags and perhaps some other content you are currently stripping. But, that content would be rendered safe. Besides, someone trying to inject JavaScript in a submission is probably not submitting a legitimate post anyways. So, if that code is displayed that was their fault anyways. Just seems like a lot of work for not much value. Especially since the more complex the code the more likely there is a bug you are not aware of.

Link to comment
Share on other sites

Well, it is entirely up to you as to what is acceptable and what is not. Since you posted this question we attempted to help you with what you are trying to do. But, I'll give you my opinion on this subject.

 

When accepting user input I think it is generally a bad idea to ever modify the user input without their knowledge. There are some minor exceptions such as when the user enters a data - as long as I can interpret the value of the date I'm not concerned with the format. But, you should make a conscious decision as to what is not allowed for a particular input. If the input contains anything that is not allowed I believe you should reject the input instead of modifying it.

 

In your current process someone might enter some text that has meaning to what they entered but which would be stripped out. I have sometime put faux <sarcasm> tags around text in a post to give context to the text. If that was removed it could be interpreted incorrectly.

 

So, my advice is that if you do not want tags in the input simply reject the post and make the suer reject it. But, there is no reason you can't accept the post. Because if you run the text through htmlspecialchars() it will be safe to display in the page.

 

Lastly, I would advise you store the input in its original format and use any modifications when outputting the content. Otherwise, it can be come difficult to add any functions for modifying the content or to output it for different uses.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.