Jump to content

problem creating entities for some characters


michaellunsford

Recommended Posts

Having a big problem right now with the right-side apostrophe. You know, that little ’ character: also known as ’ or ’ I don't know what the deal is, but when it's coming out of a MySQL database, php just doesn't display it properly (I get a question mark - and the w3c validator whines about utf-8 compliance). I've checked one of the problem records in phpMyAdmin but it's displaying correctly!!! So, there's something amiss in php_fetch_assoc or something. (btw, if you care, the MySQL field collation is utf8_general_ci)

 

So, I'm thinking I could just convert it to an entity before I send it to MySQL, and all will be right in the world. Enter experiment:

 

htmlentities will convert it, right?

<?php echo htmlentities("’"); ?>

well, not so much. It returns "â??" -- that's actually creating three characters from one. And what's with those trailing question marks? Well, maybe htmlentities sister will treat it better.

<?php echo htmlspecialchars("’"); ?>

Nope, she returns the unmodified ’ character -- which is exactly what I sent. That's no good.

 

Sooo, I tried making my own:

<?php
$str="Hello there, it’s all you!";
echo preg_replace_callback("/[\x80-\xFF]/",create_function('$matches','return "&#".ord($matches[0]).";";'),$str);
?>

guess what? It creates this mess: Hello there, it’s all you!

Hello there, it&#226;&#128;&#153;s all you!

That's three characters for the price of one - I bet this is what htmlentities() was doing.

 

Okay, I'm officially stumped. Short of using str_replace("’","'",$variable) -- what's a body to do?

Yup, sure is:

HTTP/1.1 200 OK
Date: Thu, 30 Jun 2011 22:03:56 GMT
Server: Apache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8

echo htmlentities("’", ENT_QUOTES, 'UTF-8');

 

Works fine for me.

 

UTF-8 shouldn't need to be run through htmlentities() though. Using htmlspecialchars() should work fine.

 

 

On further testing, htmlspecialchars() doesn't seem need the charset defined, as you should never run into a multibyte character that could be mistaken for <,>,",etc. I'd add the charset just in case, but further testing might show it's not needed.

 

 

Here we go, from the manual

Kenneth Kin Lum 09-Oct-2008 01:45

if your goal is just to protect your page from Cross Site Scripting (XSS) attack, or just to show HTML tags on a web page (showing <body> on the page, for example), then using htmlspecialchars() is good enough and better than using htmlentities().  A minor point is htmlspecialchars() is faster than htmlentities().  A more important point is, when we use  htmlspecialchars($s) in our code, it is automatically compatible with UTF-8 string.  Otherwise, if we use htmlentities($s), and there happens to be foreign characters in the string $s in UTF-8 encoding, then htmlentities() is going to mess it up, as it modifies the byte 0x80 to 0xFF in the string to entities like é.  (unless you specifically provide a second argument and a third argument to htmlentities(), with the third argument being "UTF-8").

 

The reason htmlspecialchars($s) already works with UTF-8 string is that, it changes bytes that are in the range 0x00 to 0x7F to < etc, while leaving bytes in the range 0x80 to 0xFF unchanged.  We may wonder whether htmlspecialchars() may accidentally change any byte in a 2 to 4 byte UTF-8 character to < etc.  The answer is, it won't.  When a UTF-8 character is 2 to 4 bytes long, all the bytes in this character is in the 0x80 to 0xFF range. None can be in the 0x00 to 0x7F range.  When a UTF-8 character is 1 byte long, it is just the same as ASCII, which is 7 bit, from 0x00 to 0x7F.  As a result, when a UTF-8 character is 1 byte long, htmlspecialchars($s) will do its job, and when the UTF-8 character is 2 to 4 bytes long, htmlspecialchars($s) will just pass those bytes unchanged.  So htmlspecialchars($s) will do the same job no matter whether $s is in ASCII, ISO-8859-1 (Latin-1), or UTF-8.

Use htmlspecialchars() unless for some reason you actually NEED every possible entity converted with htmlentities()

 

It's faster, you don't have to worry about a broken UTF-8 string nulling your output, and I just don't trust PHP's string functions when dealing with UTF-8.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.