problem creating entities for some characters

michaellunsford · June 30, 2011

Having a big problem right now with the right-side apostrophe. You know, that little ’ character: also known as ’ or ’ I don't know what the deal is, but when it's coming out of a MySQL database, php just doesn't display it properly (I get a question mark - and the w3c validator whines about utf-8 compliance). I've checked one of the problem records in phpMyAdmin but it's displaying correctly!!! So, there's something amiss in php_fetch_assoc or something. (btw, if you care, the MySQL field collation is utf8_general_ci)

So, I'm thinking I could just convert it to an entity before I send it to MySQL, and all will be right in the world. Enter experiment:

htmlentities will convert it, right?

<?php echo htmlentities("’"); ?>

well, not so much. It returns "â??" -- that's actually creating three characters from one. And what's with those trailing question marks? Well, maybe htmlentities sister will treat it better.

<?php echo htmlspecialchars("’"); ?>

Nope, she returns the unmodified ’ character -- which is exactly what I sent. That's no good.

Sooo, I tried making my own:

<?php
$str="Hello there, it’s all you!";
echo preg_replace_callback("/[\x80-\xFF]/",create_function('$matches','return "&#".ord($matches[0]).";";'),$str);
?>

guess what? It creates this mess: Hello there, itâ€™s all you!

Hello there, it&#226;&#128;&#153;s all you!

That's three characters for the price of one - I bet this is what htmlentities() was doing.

Okay, I'm officially stumped. Short of using str_replace("’","'",$variable) -- what's a body to do?

MadTechie · June 30, 2011

Is the html outputting as UTF-8 ?

ie at the VERY top add

header ('Content-type: text/html; charset=utf-8');

michaellunsford · June 30, 2011

Yup, sure is:

HTTP/1.1 200 OK
Date: Thu, 30 Jun 2011 22:03:56 GMT
Server: Apache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8

xyph · June 30, 2011

echo htmlentities("’", ENT_QUOTES, 'UTF-8');

Works fine for me.

UTF-8 shouldn't need to be run through htmlentities() though. Using htmlspecialchars() should work fine.

On further testing, htmlspecialchars() doesn't seem need the charset defined, as you should never run into a multibyte character that could be mistaken for <,>,",etc. I'd add the charset just in case, but further testing might show it's not needed.

Here we go, from the manual

Kenneth Kin Lum 09-Oct-2008 01:45
if your goal is just to protect your page from Cross Site Scripting (XSS) attack, or just to show HTML tags on a web page (showing <body> on the page, for example), then using htmlspecialchars() is good enough and better than using htmlentities(). A minor point is htmlspecialchars() is faster than htmlentities(). A more important point is, when we use htmlspecialchars($s) in our code, it is automatically compatible with UTF-8 string. Otherwise, if we use htmlentities($s), and there happens to be foreign characters in the string $s in UTF-8 encoding, then htmlentities() is going to mess it up, as it modifies the byte 0x80 to 0xFF in the string to entities like é. (unless you specifically provide a second argument and a third argument to htmlentities(), with the third argument being "UTF-8").

The reason htmlspecialchars($s) already works with UTF-8 string is that, it changes bytes that are in the range 0x00 to 0x7F to < etc, while leaving bytes in the range 0x80 to 0xFF unchanged. We may wonder whether htmlspecialchars() may accidentally change any byte in a 2 to 4 byte UTF-8 character to < etc. The answer is, it won't. When a UTF-8 character is 2 to 4 bytes long, all the bytes in this character is in the 0x80 to 0xFF range. None can be in the 0x00 to 0x7F range. When a UTF-8 character is 1 byte long, it is just the same as ASCII, which is 7 bit, from 0x00 to 0x7F. As a result, when a UTF-8 character is 1 byte long, htmlspecialchars($s) will do its job, and when the UTF-8 character is 2 to 4 bytes long, htmlspecialchars($s) will just pass those bytes unchanged. So htmlspecialchars($s) will do the same job no matter whether $s is in ASCII, ISO-8859-1 (Latin-1), or UTF-8.

michaellunsford · June 30, 2011

well, interesting, the 'utf-8' parameter in htmlentities() did it. I'm making a permanent note of this. Thanks for the tip xyph!

xyph · June 30, 2011

Use htmlspecialchars() unless for some reason you actually NEED every possible entity converted with htmlentities()

It's faster, you don't have to worry about a broken UTF-8 string nulling your output, and I just don't trust PHP's string functions when dealing with UTF-8.

Sign In

problem creating entities for some characters

Recommended Posts

michaellunsford

Link to comment

Share on other sites

MadTechie

Link to comment

Share on other sites

michaellunsford

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

michaellunsford

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information