michaellunsford Posted June 30, 2011 Share Posted June 30, 2011 Having a big problem right now with the right-side apostrophe. You know, that little ’ character: also known as ’ or ’ I don't know what the deal is, but when it's coming out of a MySQL database, php just doesn't display it properly (I get a question mark - and the w3c validator whines about utf-8 compliance). I've checked one of the problem records in phpMyAdmin but it's displaying correctly!!! So, there's something amiss in php_fetch_assoc or something. (btw, if you care, the MySQL field collation is utf8_general_ci) So, I'm thinking I could just convert it to an entity before I send it to MySQL, and all will be right in the world. Enter experiment: htmlentities will convert it, right? <?php echo htmlentities("’"); ?> well, not so much. It returns "â??" -- that's actually creating three characters from one. And what's with those trailing question marks? Well, maybe htmlentities sister will treat it better. <?php echo htmlspecialchars("’"); ?> Nope, she returns the unmodified ’ character -- which is exactly what I sent. That's no good. Sooo, I tried making my own: <?php $str="Hello there, it’s all you!"; echo preg_replace_callback("/[\x80-\xFF]/",create_function('$matches','return "&#".ord($matches[0]).";";'),$str); ?> guess what? It creates this mess: Hello there, it’s all you! Hello there, it’s all you! That's three characters for the price of one - I bet this is what htmlentities() was doing. Okay, I'm officially stumped. Short of using str_replace("’","'",$variable) -- what's a body to do? Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/ Share on other sites More sharing options...
MadTechie Posted June 30, 2011 Share Posted June 30, 2011 Is the html outputting as UTF-8 ? ie at the VERY top add header ('Content-type: text/html; charset=utf-8'); Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/#findComment-1237010 Share on other sites More sharing options...
michaellunsford Posted June 30, 2011 Author Share Posted June 30, 2011 Yup, sure is: HTTP/1.1 200 OK Date: Thu, 30 Jun 2011 22:03:56 GMT Server: Apache Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Vary: Accept-Encoding Content-Type: text/html; charset=utf-8 Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/#findComment-1237017 Share on other sites More sharing options...
xyph Posted June 30, 2011 Share Posted June 30, 2011 echo htmlentities("’", ENT_QUOTES, 'UTF-8'); Works fine for me. UTF-8 shouldn't need to be run through htmlentities() though. Using htmlspecialchars() should work fine. On further testing, htmlspecialchars() doesn't seem need the charset defined, as you should never run into a multibyte character that could be mistaken for <,>,",etc. I'd add the charset just in case, but further testing might show it's not needed. Here we go, from the manual Kenneth Kin Lum 09-Oct-2008 01:45 if your goal is just to protect your page from Cross Site Scripting (XSS) attack, or just to show HTML tags on a web page (showing <body> on the page, for example), then using htmlspecialchars() is good enough and better than using htmlentities(). A minor point is htmlspecialchars() is faster than htmlentities(). A more important point is, when we use htmlspecialchars($s) in our code, it is automatically compatible with UTF-8 string. Otherwise, if we use htmlentities($s), and there happens to be foreign characters in the string $s in UTF-8 encoding, then htmlentities() is going to mess it up, as it modifies the byte 0x80 to 0xFF in the string to entities like é. (unless you specifically provide a second argument and a third argument to htmlentities(), with the third argument being "UTF-8"). The reason htmlspecialchars($s) already works with UTF-8 string is that, it changes bytes that are in the range 0x00 to 0x7F to < etc, while leaving bytes in the range 0x80 to 0xFF unchanged. We may wonder whether htmlspecialchars() may accidentally change any byte in a 2 to 4 byte UTF-8 character to < etc. The answer is, it won't. When a UTF-8 character is 2 to 4 bytes long, all the bytes in this character is in the 0x80 to 0xFF range. None can be in the 0x00 to 0x7F range. When a UTF-8 character is 1 byte long, it is just the same as ASCII, which is 7 bit, from 0x00 to 0x7F. As a result, when a UTF-8 character is 1 byte long, htmlspecialchars($s) will do its job, and when the UTF-8 character is 2 to 4 bytes long, htmlspecialchars($s) will just pass those bytes unchanged. So htmlspecialchars($s) will do the same job no matter whether $s is in ASCII, ISO-8859-1 (Latin-1), or UTF-8. Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/#findComment-1237022 Share on other sites More sharing options...
michaellunsford Posted June 30, 2011 Author Share Posted June 30, 2011 well, interesting, the 'utf-8' parameter in htmlentities() did it. I'm making a permanent note of this. Thanks for the tip xyph! Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/#findComment-1237026 Share on other sites More sharing options...
xyph Posted June 30, 2011 Share Posted June 30, 2011 Use htmlspecialchars() unless for some reason you actually NEED every possible entity converted with htmlentities() It's faster, you don't have to worry about a broken UTF-8 string nulling your output, and I just don't trust PHP's string functions when dealing with UTF-8. Quote Link to comment https://forums.phpfreaks.com/topic/240835-problem-creating-entities-for-some-characters/#findComment-1237031 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.