Jump to content

Converting to html entities.


Bottyz

Recommended Posts

Hi all,

 

Just a quick query really. We have a contact script on our website which has a message box for the enquiry. Once a user submits their message, the message is checked for malicious code, then converted to html format for use in a pear mail html email. Everything works great except two symbols which never get converted correctly, they always turn into ?? rather than the symbols themselves.

 

The symbols in question are marks in the code below....

 

function cndstrips($str) {
		if (get_magic_quotes_gpc()) {                        
		return htmlentities(utf8_decode(html_entity_decode(nl2br(stripslashes($str)))));
		} else {
		return htmlentities(utf8_decode(html_entity_decode(nl2br($str))));
		}
	}

	$message_body=trim(previous_request_value('message_body'));
	$message_body=str_replace('-', '-', $message_body);
	$message_body=str_replace('‘', ''', $message_body); // This Symbol
	$message_body=str_replace('”', '"', $message_body); // This Symbol too!
	$message_body=cndstrips($message_body);
	$message_body=str_replace('<br />', '<br>', $message_body);
	$message_body=str_replace('<br>', '<br>', $message_body);
	$message_body=str_replace('<br />', '<br>', $message_body);

 

I have attempted to convert them to their relevant html entity numbers (by str_replace) but this doesn't work. Is there an easier way to do the above? I'm not the best with sanitising code as you may see!!

 

Thanks in advance.

Link to comment
Share on other sites

 

I had a similar issue with htmlentities the other day, and xyph gave me some good pointer here.

 

The doctype must be correct, the content-type and also htmlentitieshas a second parameter that can be set to which i never knew about which determines its character set.

 

Hi Freelance,

 

Thanks for the pointer... I think it was something to do with the UTF8 encoding. I searched for a while on the php net site. And Found a very good function which is good at converting all special characters to their &#number; entities. It works great for the script, I've in cluded it below i case anyone else would like to use it:

 


//$message_body variable is the content from the textarea on the contact form. It can contain any character a user can input.

	class unicode_replace_entities {
	    public function UTF8entities($content="") {
	        $contents = $this->unicode_string_to_array($content);
	        $swap = "";
	        $iCount = count($contents);
	        for ($o=0;$o<$iCount;$o++) {
	            $contents[$o] = $this->unicode_entity_replace($contents[$o]);
	           $swap .= $contents[$o];
		    }
		    return mb_convert_encoding($swap,"UTF-8"); //not really necessary, but why not.
		}

	    public function unicode_string_to_array( $string ) { //adjwilli
	        $strlen = mb_strlen($string);
	        while ($strlen) {
	            $array[] = mb_substr( $string, 0, 1, "UTF-8" );
	            $string = mb_substr( $string, 1, $strlen, "UTF-8" );
	            $strlen = mb_strlen( $string );
	        }
	        return $array;
	    }

		public function unicode_entity_replace($c) { //m. perez
		    $h = ord($c{0});   
		    if ($h <= 0x7F) {
		        return $c;
		    } else if ($h < 0xC2) {
		        return $c;
		    }
           
		    if ($h <= 0xDF) {
		        $h = ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
		        $h = "&#" . $h . ";";
		        return $h;
		    } else if ($h <= 0xEF) {
				$h = ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6 | (ord($c{2}) & 0x3F);
				$h = "&#" . $h . ";";
			    return $h;
		    } else if ($h <= 0xF4) {
		        $h = ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12 | (ord($c{2}) & 0x3F) << 6 | (ord($c{3}) & 0x3F);
		        $h = "&#" . $h . ";";
		       return $h;
		    }
		}
	}
   
	$message_body=trim(previous_request_value('message_body')); // retrieves the user input from the textarea and trims spaces.
	$message_body=nl2br($message_body); // converts all carriage returns etc to html line breaks (<br />). Not important for anything other than the way I create a html email to send via Pear Mail.
	$oUnicodeReplace = new unicode_replace_entities();
	$message_body = $oUnicodeReplace->UTF8entities($message_body); // calls the function to convert to entity numbers.


	$message_body=str_replace('<br />', '<br>', $message_body); // changes all the <br /> created by nl2br to <br>, not important for anything other than the way i create my html emails.
	$message_body=str_replace('<br>', '<br>', $message_body); // same as above comment
	$message_body=str_replace('<br />', '<br>', $message_body); // same as above comment

 

Thinking about the above way I use the function and build a html email using pear mail. Should I be santising the input further or doing something to prevent xss? I'm not overly clued up on sanitisation and I've looked about, but there are mixed messages. Some people say you have to escape user inputted strings, some say you don't. Any comments would be appreciated!

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.