fife Posted February 28, 2012 Share Posted February 28, 2012 I have a function that removes html, javascript and php from a text box that users of my site can write in. The problem is when a user types say; hello. My name is Danny And I love php freaks into my text box. Its also removing the new line entry's. I would very much stop to stop it doing that so it looks like I have paragraphs. Can someone please help me. Here is how I call the function. $description = strip_word_html($_POST['description'], $allowed_tags = '<b><i><sup><sub><em><strong><u><br><br/><br />'); And the function itself with notes. //remove html java and php function strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><br/><br />') { mb_regex_encoding('UTF-8'); //replace MS special characters first $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u'); $replace = array('\'', '\'', '"', '"', '-'); $text = preg_replace($search, $replace, $text); //make sure _all_ html entities are converted to the plain ascii equivalents - it appears //in some MS headers, some html entities are encoded and some aren't $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); //try to strip out any C style comments first, since these, embedded in html comments, seem to //prevent strip_tags from removing html comments (MS Word introduced combination) if(mb_stripos($text, '/*') !== FALSE){ $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm'); } //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be //'<1' becomes '< 1'(note: somewhat application specific) $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text); $text = strip_tags($text, $allowed_tags); //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text); //strip out inline css and simplify style tags $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu'); $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>'); $text = preg_replace($search, $replace, $text); //some MS Style Definitions - this last bit gets rid of any leftover comments */ $num_matches = preg_match_all("/\<!--/u", $text, $matches); if($num_matches){ $text = preg_replace('/\<!--(.)*--\>/isu', '', $text); } return $text; } Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/ Share on other sites More sharing options...
codebyren Posted February 28, 2012 Share Posted February 28, 2012 Are you wrapping the $description in <pre> tags when you render it? If not, I don't believe the newlines will be honoured (from a presentation perspective) even if they do happen to make it through your filter function. Then, I haven't looked through your function but you may want to have a look at implementing htmlpurifier to handle this for you. When I looked into doing this sort of thing a few years back it turned into a massive task and you never quite filter out everything you think you are. May as well try a tool developed for the task. Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/#findComment-1322107 Share on other sites More sharing options...
ManiacDan Posted February 28, 2012 Share Posted February 28, 2012 In HTML, all whitespace is condensed to a single space. You can either run nl2br on it, or wrap the output in <pre> tags, whichever you think works better. Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/#findComment-1322113 Share on other sites More sharing options...
Psycho Posted February 28, 2012 Share Posted February 28, 2012 Are you sure that mammoth function is really worthwhile? I would think just using strip_tags() and htmlspecialchars() would be enough. It would leave any JS code that was between the opening/closing tags and perhaps some other content you are currently stripping. But, that content would be rendered safe. Besides, someone trying to inject JavaScript in a submission is probably not submitting a legitimate post anyways. So, if that code is displayed that was their fault anyways. Just seems like a lot of work for not much value. Especially since the more complex the code the more likely there is a bug you are not aware of. Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/#findComment-1322143 Share on other sites More sharing options...
fife Posted February 29, 2012 Author Share Posted February 29, 2012 cool thanks guys. I honestly thought you had to go over board with this sort of thing on a users area. I manage to fix it by removing the function and doing what Psycho said with the strip_tags() and htmlspecialchars() Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/#findComment-1322276 Share on other sites More sharing options...
Psycho Posted February 29, 2012 Share Posted February 29, 2012 Well, it is entirely up to you as to what is acceptable and what is not. Since you posted this question we attempted to help you with what you are trying to do. But, I'll give you my opinion on this subject. When accepting user input I think it is generally a bad idea to ever modify the user input without their knowledge. There are some minor exceptions such as when the user enters a data - as long as I can interpret the value of the date I'm not concerned with the format. But, you should make a conscious decision as to what is not allowed for a particular input. If the input contains anything that is not allowed I believe you should reject the input instead of modifying it. In your current process someone might enter some text that has meaning to what they entered but which would be stripped out. I have sometime put faux <sarcasm> tags around text in a post to give context to the text. If that was removed it could be interpreted incorrectly. So, my advice is that if you do not want tags in the input simply reject the post and make the suer reject it. But, there is no reason you can't accept the post. Because if you run the text through htmlspecialchars() it will be safe to display in the page. Lastly, I would advise you store the input in its original format and use any modifications when outputting the content. Otherwise, it can be come difficult to add any functions for modifying the content or to output it for different uses. Quote Link to comment https://forums.phpfreaks.com/topic/257940-remove-html-but-keep-formatting/#findComment-1322317 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.