Jump to content

problem creating entities for some characters


michaellunsford

Recommended Posts

Having a big problem right now with the right-side apostrophe. You know, that little ’ character: also known as ’ or ’ I don't know what the deal is, but when it's coming out of a MySQL database, php just doesn't display it properly (I get a question mark - and the w3c validator whines about utf-8 compliance). I've checked one of the problem records in phpMyAdmin but it's displaying correctly!!! So, there's something amiss in php_fetch_assoc or something. (btw, if you care, the MySQL field collation is utf8_general_ci)

 

So, I'm thinking I could just convert it to an entity before I send it to MySQL, and all will be right in the world. Enter experiment:

 

htmlentities will convert it, right?

<?php echo htmlentities("’"); ?>

well, not so much. It returns "â??" -- that's actually creating three characters from one. And what's with those trailing question marks? Well, maybe htmlentities sister will treat it better.

<?php echo htmlspecialchars("’"); ?>

Nope, she returns the unmodified ’ character -- which is exactly what I sent. That's no good.

 

Sooo, I tried making my own:

<?php
$str="Hello there, it’s all you!";
echo preg_replace_callback("/[\x80-\xFF]/",create_function('$matches','return "&#".ord($matches[0]).";";'),$str);
?>

guess what? It creates this mess: Hello there, it’s all you!

Hello there, it&#226;&#128;&#153;s all you!

That's three characters for the price of one - I bet this is what htmlentities() was doing.

 

Okay, I'm officially stumped. Short of using str_replace("’","'",$variable) -- what's a body to do?

Link to comment
Share on other sites

echo htmlentities("’", ENT_QUOTES, 'UTF-8');

 

Works fine for me.

 

UTF-8 shouldn't need to be run through htmlentities() though. Using htmlspecialchars() should work fine.

 

 

On further testing, htmlspecialchars() doesn't seem need the charset defined, as you should never run into a multibyte character that could be mistaken for <,>,",etc. I'd add the charset just in case, but further testing might show it's not needed.

 

 

Here we go, from the manual

Kenneth Kin Lum 09-Oct-2008 01:45

if your goal is just to protect your page from Cross Site Scripting (XSS) attack, or just to show HTML tags on a web page (showing <body> on the page, for example), then using htmlspecialchars() is good enough and better than using htmlentities().  A minor point is htmlspecialchars() is faster than htmlentities().  A more important point is, when we use  htmlspecialchars($s) in our code, it is automatically compatible with UTF-8 string.  Otherwise, if we use htmlentities($s), and there happens to be foreign characters in the string $s in UTF-8 encoding, then htmlentities() is going to mess it up, as it modifies the byte 0x80 to 0xFF in the string to entities like é.  (unless you specifically provide a second argument and a third argument to htmlentities(), with the third argument being "UTF-8").

 

The reason htmlspecialchars($s) already works with UTF-8 string is that, it changes bytes that are in the range 0x00 to 0x7F to < etc, while leaving bytes in the range 0x80 to 0xFF unchanged.  We may wonder whether htmlspecialchars() may accidentally change any byte in a 2 to 4 byte UTF-8 character to < etc.  The answer is, it won't.  When a UTF-8 character is 2 to 4 bytes long, all the bytes in this character is in the 0x80 to 0xFF range. None can be in the 0x00 to 0x7F range.  When a UTF-8 character is 1 byte long, it is just the same as ASCII, which is 7 bit, from 0x00 to 0x7F.  As a result, when a UTF-8 character is 1 byte long, htmlspecialchars($s) will do its job, and when the UTF-8 character is 2 to 4 bytes long, htmlspecialchars($s) will just pass those bytes unchanged.  So htmlspecialchars($s) will do the same job no matter whether $s is in ASCII, ISO-8859-1 (Latin-1), or UTF-8.

Link to comment
Share on other sites

Use htmlspecialchars() unless for some reason you actually NEED every possible entity converted with htmlentities()

 

It's faster, you don't have to worry about a broken UTF-8 string nulling your output, and I just don't trust PHP's string functions when dealing with UTF-8.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.