NotionCommotion Posted January 11, 2015 Share Posted January 11, 2015 I recently read another post which stated: Unfortunately, you should verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding() does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably. Seems like a lot of work. How important is really doing so? Can the DB be configured in some kind of strict mode which will error upon anything which isn't, and I can deal with it as an exception? Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted January 11, 2015 Share Posted January 11, 2015 The main issue is people setting page encoding incorrect or invalid characters/improper coding. If you want your data clean you should. I usually try to detect the entire page or text and encode it once using iconv Quote Link to comment Share on other sites More sharing options...
NotionCommotion Posted January 11, 2015 Author Share Posted January 11, 2015 Thanks QuickOldCar, I am just getting more into encoding, know nothing about iconv, and am a bit overwhelmed. A couple of basics... Client sends POST or GET request to server. Should I check each element in the array to ensure it is utf-8? How would this be done using iconv? My php.ini file states that "PHP's default character set is set to empty", and that refers to utf-8. But I thought that PHP was basically utf-8 unaware. What is the point? When creating a PDO connection, I include charset=utf8. Do I also need to configure the MySQL database as utf8? How is this done? When sending HTML to the browser, I include the <meta charset="utf-8" /> so the client interprets it correctly. Should I be doing something similar for JSON? Quote Link to comment Share on other sites More sharing options...
kicken Posted January 11, 2015 Share Posted January 11, 2015 Client sends POST or GET request to server. Should I check each element in the array to ensure it is utf-8? How would this be done using iconv?That would be one quick and easy thing you could do, but not necessarily the most flexible. Similar to the old register globals hack, just loop over the input arrays ($_COOKIE and $_FILES also) and check each value. If any value is an invalid UTF8 sequence you could just reject the request. My php.ini file states that "PHP's default character set is set to empty", and that refers to utf-8. But I thought that PHP was basically utf-8 unaware. What is the point? The default_charset parameter is used in a few functions (such as htmlentities). Overall PHP is basically character set unaware and deals in plain ascii. When creating a PDO connection, I include charset=utf8. Do I also need to configure the MySQL database as utf8? How is this done? Yes. You can set the encoding for the entire table or on a per-column basis. eg: create table blah ( name varchar(255) ) DEFAULT CHARACTER SET=utf8 or create table blah ( name varchar(255) CHARACTER SET utf8 ) When sending HTML to the browser, I include the <meta charset="utf-8" /> so the client interprets it correctly. Should I be doing something similar for JSON? You should set the character set in the Content-type header. This applies for both your HTML pages and the JSON data. header('Content-type: text/html; charset=utf-8'); or header('Content-type: application/json; charset=utf-8'); 1 Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted January 11, 2015 Share Posted January 11, 2015 It's sad that this day and age this is such an issue, should have been one of the first things knocked out over the web. To me all sites should serve utf-* or not even work, but I'm just a peon in the web world. There is absolutely too many types and variations of encoding period!! A while ago it was announced php6 would resolve these issues but it never happened. Kicken answered the questions If any value is an invalid UTF8 sequence you could just reject the request. I try to detect if is utf8 and if isn't do the conversion using iconv Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 11, 2015 Share Posted January 11, 2015 Checking if the input is valid UTF-8 is neither necessary nor particularly useful. Personally, I've never done it. What are you trying to achieve with this? It does not increase security, because an invalid string is simply an invalid string. The worst that could happen is that the characters aren't displayed correctly. So what? It's also doesn't increase usability, because the browser already takes care of the encoding. An error is very unlikely. It's really only possible if the client uses some homegrown bot which somehow doesn't understand encodings. Quote Link to comment Share on other sites More sharing options...
NotionCommotion Posted January 11, 2015 Author Share Posted January 11, 2015 (edited) Yes. You can set the encoding for the entire table or on a per-column basis. eg: You should set the character set in the Content-type header. This applies for both your HTML pages and the JSON data. Thanks Kicken, Any reason I wouldn't want to specifiy utf-8 encoding on the whole table? Or better yet, the whole database schema? Assume a table just has int, tinyint, varchar, char, text, datetime columns, is there any problem?. What about if the table includes something like a BINARY column? If so, could I do the whole database or whole table, and only got to a different encoding for specific columns? And, yes, I do use header('Content-type: text/html'); and header('Content-type: application/json; charset'); but never specified the encoding. I take it is is utf-8 by default if not specified? Good practice explicitly call out the coding even if default? I also include a utf-8 meta tag in the HTML. Probably redundant. Any reason not to do it? Edited January 11, 2015 by NotionCommotion Quote Link to comment Share on other sites More sharing options...
NotionCommotion Posted January 11, 2015 Author Share Posted January 11, 2015 Checking if the input is valid UTF-8 is neither necessary nor particularly useful. Personally, I've never done it. Good! For once, just what I wanted to hear Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 11, 2015 Share Posted January 11, 2015 As to the encoding declaration: You should always declare the encoding in the Content-Type header and use a meta element. This is explicitly recommended by the W3C. While the two declaration may sound redundant, they're not: The HTTP header is for the client which directly receives the server response. But then the document may be stored, in which case the HTTP headers are of course lost. Now the meta element takes over. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.