Converting to html entities.

Bottyz · September 28, 2011

Hi all,

Just a quick query really. We have a contact script on our website which has a message box for the enquiry. Once a user submits their message, the message is checked for malicious code, then converted to html format for use in a pear mail html email. Everything works great except two symbols which never get converted correctly, they always turn into ?? rather than the symbols themselves.

The symbols in question are marks in the code below....

function cndstrips($str) {
		if (get_magic_quotes_gpc()) {                        
		return htmlentities(utf8_decode(html_entity_decode(nl2br(stripslashes($str)))));
		} else {
		return htmlentities(utf8_decode(html_entity_decode(nl2br($str))));
		}
	}

	$message_body=trim(previous_request_value('message_body'));
	$message_body=str_replace('-', '-', $message_body);
	$message_body=str_replace('‘', '&#39;', $message_body); // This Symbol
	$message_body=str_replace('”', '&#34;', $message_body); // This Symbol too!
	$message_body=cndstrips($message_body);
	$message_body=str_replace('<br />', '<br>', $message_body);
	$message_body=str_replace('<br>', '<br>', $message_body);
	$message_body=str_replace('<br />', '<br>', $message_body);

I have attempted to convert them to their relevant html entity numbers (by str_replace) but this doesn't work. Is there an easier way to do the above? I'm not the best with sanitising code as you may see!!

Thanks in advance.

freelance84 · September 28, 2011

I had a similar issue with htmlentities the other day, and xyph gave me some good pointer here.

The doctype must be correct, the content-type and also htmlentitieshas a second parameter that can be set to which i never knew about which determines its character set.

Bottyz · September 28, 2011

I had a similar issue with htmlentities the other day, and xyph gave me some good pointer here.

The doctype must be correct, the content-type and also htmlentitieshas a second parameter that can be set to which i never knew about which determines its character set.

Hi Freelance,

Thanks for the pointer... I think it was something to do with the UTF8 encoding. I searched for a while on the php net site. And Found a very good function which is good at converting all special characters to their &#number; entities. It works great for the script, I've in cluded it below i case anyone else would like to use it:


//$message_body variable is the content from the textarea on the contact form. It can contain any character a user can input.

	class unicode_replace_entities {
	    public function UTF8entities($content="") {
	        $contents = $this->unicode_string_to_array($content);
	        $swap = "";
	        $iCount = count($contents);
	        for ($o=0;$o<$iCount;$o++) {
	            $contents[$o] = $this->unicode_entity_replace($contents[$o]);
	           $swap .= $contents[$o];
		    }
		    return mb_convert_encoding($swap,"UTF-8"); //not really necessary, but why not.
		}

	    public function unicode_string_to_array( $string ) { //adjwilli
	        $strlen = mb_strlen($string);
	        while ($strlen) {
	            $array[] = mb_substr( $string, 0, 1, "UTF-8" );
	            $string = mb_substr( $string, 1, $strlen, "UTF-8" );
	            $strlen = mb_strlen( $string );
	        }
	        return $array;
	    }

		public function unicode_entity_replace($c) { //m. perez
		    $h = ord($c{0});   
		    if ($h <= 0x7F) {
		        return $c;
		    } else if ($h < 0xC2) {
		        return $c;
		    }
           
		    if ($h <= 0xDF) {
		        $h = ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
		        $h = "&#" . $h . ";";
		        return $h;
		    } else if ($h <= 0xEF) {
				$h = ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6 | (ord($c{2}) & 0x3F);
				$h = "&#" . $h . ";";
			    return $h;
		    } else if ($h <= 0xF4) {
		        $h = ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12 | (ord($c{2}) & 0x3F) << 6 | (ord($c{3}) & 0x3F);
		        $h = "&#" . $h . ";";
		       return $h;
		    }
		}
	}
   
	$message_body=trim(previous_request_value('message_body')); // retrieves the user input from the textarea and trims spaces.
	$message_body=nl2br($message_body); // converts all carriage returns etc to html line breaks (<br />). Not important for anything other than the way I create a html email to send via Pear Mail.
	$oUnicodeReplace = new unicode_replace_entities();
	$message_body = $oUnicodeReplace->UTF8entities($message_body); // calls the function to convert to entity numbers.


	$message_body=str_replace('<br />', '<br>', $message_body); // changes all the <br /> created by nl2br to <br>, not important for anything other than the way i create my html emails.
	$message_body=str_replace('<br>', '<br>', $message_body); // same as above comment
	$message_body=str_replace('<br />', '<br>', $message_body); // same as above comment

Thinking about the above way I use the function and build a html email using pear mail. Should I be santising the input further or doing something to prevent xss? I'm not overly clued up on sanitisation and I've looked about, but there are mixed messages. Some people say you have to escape user inputted strings, some say you don't. Any comments would be appreciated!

Sign In

Converting to html entities.

Recommended Posts

Bottyz

Link to comment

Share on other sites

freelance84

Link to comment

Share on other sites

Bottyz

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information