Jump to content

Is mb_check_encoding really required for all inputs?


NotionCommotion

Recommended Posts

I recently read another post which stated:

 

Unfortunately, you should verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding() does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.

 

 

Seems like a lot of work.  How important is really doing so?  Can the DB be configured in some kind of strict mode which will error upon anything which isn't, and I can deal with it as an exception?

Link to comment
Share on other sites

Thanks QuickOldCar,

 

I am just getting more into encoding, know nothing about iconv, and am a bit overwhelmed.  A couple of basics...

 

  • Client sends POST or GET request to server.  Should I check each element in the array to ensure it is utf-8?  How would this be done using iconv?
  • My php.ini file states that "PHP's default character set is set to empty", and that refers to utf-8.  But I thought that PHP was basically utf-8 unaware.  What is the point?
  • When creating a PDO connection, I include charset=utf8.  Do I also need to configure the MySQL database as utf8?  How is this done?
  • When sending HTML to the browser, I include the <meta charset="utf-8" /> so the client interprets it correctly.  Should I be doing something similar for JSON?
     
Link to comment
Share on other sites

  • Client sends POST or GET request to server.  Should I check each element in the array to ensure it is utf-8?  How would this be done using iconv?
That would be one quick and easy thing you could do, but not necessarily the most flexible. Similar to the old register globals hack, just loop over the input arrays ($_COOKIE and $_FILES also) and check each value. If any value is an invalid UTF8 sequence you could just reject the request.

 

 

  • My php.ini file states that "PHP's default character set is set to empty", and that refers to utf-8.  But I thought that PHP was basically utf-8 unaware.  What is the point?

 

The default_charset parameter is used in a few functions (such as htmlentities). Overall PHP is basically character set unaware and deals in plain ascii.

 

 

  • When creating a PDO connection, I include charset=utf8.  Do I also need to configure the MySQL database as utf8?  How is this done?

 

Yes. You can set the encoding for the entire table or on a per-column basis. eg:

create table blah (
  name varchar(255)
) DEFAULT CHARACTER SET=utf8

or
create table blah (
   name varchar(255) CHARACTER SET utf8
)

 

  • When sending HTML to the browser, I include the <meta charset="utf-8" /> so the client interprets it correctly.  Should I be doing something similar for JSON?

     

 

You should set the character set in the Content-type header. This applies for both your HTML pages and the JSON data.

 

header('Content-type: text/html; charset=utf-8');
or 
header('Content-type: application/json; charset=utf-8');
  • Like 1
Link to comment
Share on other sites

It's sad that this day and age this is such an issue, should have been one of the first things knocked out over the web.

To me all sites should serve utf-* or not even work, but I'm just a peon in the web world.

There is absolutely too many types and variations of encoding period!!

 

A while ago it was announced php6 would resolve these issues but it never happened.

 

Kicken answered the questions

If any value is an invalid UTF8 sequence you could just reject the request.

 

I try to detect if is utf8 and if isn't do the conversion using iconv

Link to comment
Share on other sites

Checking if the input is valid UTF-8 is neither necessary nor particularly useful. Personally, I've never done it.

 

What are you trying to achieve with this? It does not increase security, because an invalid string is simply an invalid string. The worst that could happen is that the characters aren't displayed correctly. So what?

 

It's also doesn't increase usability, because the browser already takes care of the encoding. An error is very unlikely. It's really only possible if the client uses some homegrown bot which somehow doesn't understand encodings.

Link to comment
Share on other sites

Yes. You can set the encoding for the entire table or on a per-column basis. eg:

You should set the character set in the Content-type header. This applies for both your HTML pages and the JSON data.

 

 

Thanks Kicken,

 

Any reason I wouldn't want to specifiy utf-8 encoding on the whole table?  Or better yet, the whole database schema?  Assume a table just has int, tinyint, varchar, char, text, datetime columns, is there any problem?.  What about if the table includes something like a BINARY column?  If so, could I do the whole database or whole table, and only got to a different encoding for specific columns?

 

And, yes, I do use header('Content-type: text/html'); and header('Content-type: application/json; charset'); but never specified the encoding.  I take it is is utf-8 by default if not specified?  Good practice explicitly call out the coding even if default?  I also include a utf-8 meta tag in the HTML.  Probably redundant.  Any reason not to do it?

Edited by NotionCommotion
Link to comment
Share on other sites

As to the encoding declaration:

 

You should always declare the encoding in the Content-Type header and use a meta element. This is explicitly recommended by the W3C.

 

While the two declaration may sound redundant, they're not: The HTTP header is for the client which directly receives the server response. But then the document may be stored, in which case the HTTP headers are of course lost. Now the meta element takes over.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.