Jump to content

Recommended Posts

I know a weird request, but please help. Hi I am trying to help myself but my workcenter is making that difficult. There is a link in this forum post ( http://www.phpfreaks.com/forums/index.php/topic,229038.0.html ) that I want to see but my workcenter has blocked the site because it is uncatagorized. Can some one copy and past the blog back to the forum for me?

Thank you!

 

 

    * Log in

    * Subscribe RSS Feed

 

code.openark.org

 

    *

      Blog by Shlomi Noach

    * Home

    * About

    * Shlomi Noach

    * יעוץ

 

 

 

 

    *

      Search

          o

    *

      About me

          o Name: Shlomi Noach

            Location: Israel

 

            I am a DBA, consultant and an authorized MySQL instructor. A developer (mainly C/C++, Java, Web) and code architect.

            I provide MySQL consulting services in Israel and world wide, and am available for hire.

    *

      Recent Posts

          o LOCK TABLES in MyISAM is NOT a poor man’s tranactions substitute

          o Tales of the Trade #1: A day in the life of a MySQL instructor

          o MySQL User Group Meetings in Israel

          o Blocking user accounts

          o Announcing openark kit

    *

      Archives

          o March 2009

          o February 2009

          o January 2009

          o December 2008

          o November 2008

          o October 2008

          o June 2008

    *

      Blogroll

          o openark forge

          o Planet MySQL

          o יעוץ MySQL

    *

      Categories

          o MySQL (RSS) (29)

    *

      Meta

          o RSS 2.0 Feed

          o Atom Feed

          o Comments RSS Feed

          o Log in

          o Valid XHTML

    *

      Tag Cloud

          o Analysis Backup Configuration Data Types Event Scheduler Execution plan Humor Indexing INFORMATION_SCHEMA InnoDB Installation Linux memcached MyISAM mysqldump Normalization openark kit python Query Cache Replication Schema Security SQL sql_mode Syntax Triggers User Group

 

Dec

08

2008

MySQL’s character sets and collations demystified

 

MySQL’s character sets and collations are often considered as a mystery, and many users either completely disregard them and keep with the defaults, or set everything to UTF8.

 

This post will attempt to shed some light on the mystery, and provide with some best practices for use with text columns with regard to character sets.

Character Sets

 

A thorough discussion of how the character sets have evolved through history is beyond the scope of this post. While the Unicode standard is gaining recognition, the “older” character sets are still around. Understanding the difference between Unicode and local character sets is crucial.

 

Consider, for example, MySQL’s latin1 character set. In this character set there are 256 different characters, represented by one byte. The first 128 characters map to ASCII, the standard “ABCabc012 dot comma” set, of which most of this post is composed. The latter 128 characters in latin1 are composed of West European specific characters, such as À, ë, õ, Ñ.

 

A Name VARCHAR(60) CHARSET latin1 column can describe names with West European characters. But it cannot describe Russian or Hebrew names. To represent a name in Hebrew, you’d need the hebrew charset (ISO 8859-8), in which the first 128 characters are, as always, mapped to ASCII, and the latter 128 characters describe the Hebrew alphabet and punctuation marks, such as ש,ל,מ,ה. The Cyrillic, Arabic and Turkish charsets follow in a similar manner.

 

Assume now I’m building a world wide web application, such as a popular social network. I would like to store the first names of my users, in every possible language. None of the above character sets support all languages. I therefore turn to Unicode. In particular, MySQL supports utf8, a Unicode encoding scheme, which is commonly used due to its economic storage requirements.

 

In Unicode there is a dedicated number for each letter in the known languages, in ancient languages, and some imaginary or otherwise non existing languages, such as Klingon (yes, I know there are people who actually speak Klingon), may yet find their way into the standard.

 

UTF8 (or utf8), a Unicode encoding scheme, states the following: for ASCII characters, such as ‘a’, ‘6′, ‘$’, only one byte of storage is required. For Hebrew, Cyrillic or Turkish characters, 2 bytes are required. For Japanese, Chinese - more (MySQL supports up to 3 bytes per character). Again, the exact details of the implementation are beyond the scope of this post, and are well described here and here.

 

What’s important to me is that I can define Name VARCHAR(30) CHARSET utf8 for my columns, and Voila! Any name can be represented in my database.

So why not define everything as utf8 and get done with it?

 

Well, it just so happens that Unicode comes with a price. See, for example, the following column definition:

 

    CountryCode CHAR(3) CHARSET utf8

 

We are asking for a column with 3 characters exactly. The required storage for this column will be such that any 3-letter name must fit in. This means (3 characters) times (3 bytes per character) = 9 bytes of storage. So CHAR and utf8 together may be less than ideal. VARCHAR behaves better: it only requires as many bytes per character as described above. So the text “abc” will only require 3 bytes (plus VARCHAR’s leading 1 or 2 bytes).

Why don’t we drop the ‘CHAR’ altogether, then, and use only ‘VARCHAR’?

 

Because some values are simply better represented with CHAR: consider a “password” column, encoded with MD5. The MD5() function returns a 32 characters long text. It’s always 32 characters, and, moreover, it’s always in ASCII. The best data type and character set definition would be password CHAR(32) CHARSET ascii. We thus ensure exactly 32 bytes are allocated to this column. A VARCHAR will acquire an additional byte or two, depending on its defined length, which will indicate the length of the text.

And why would I care about collations?

 

Collations deal with text comparison. We observed that the default character set in MySQL is latin1. The default collation is latin1_swedish_ci. In this collation the following holds true: 'ABC' = 'abc'.

 

Wait. What?

 

Look at the “ci” in latin1_swedish_ci. It stands for “case insensitive”. Collations which end with “cs” or “bin” are case sensitive. The utf8 character set comes with utf8_general_ci collation. This can make sense. Let’s review our web application table (I’m using plain text passwords here, bare with me for this example):

 

    CREATE TABLE my_users (

      name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci,

      plainPassword VARCHAR(16) CHARSET ASCII,

      UNIQUE KEY (name)

    );

    INSERT INTO my_users (name, password) VALUES ('David', 'mypass');

 

It holds true that the name ‘David’ equals ‘david’. If I were to SELECT * FROM my_users WHERE name='david', I would find the desired row. The unique key will also guarantee that no daVID user can be added.

 

But David certainly wouldn’t want users to login with the password ‘MYPASS’. So we refine our table:

 

    CREATE TABLE my_users (

      name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci,

      plainPassword VARCHAR(16) CHARSET ascii COLLATE ascii_bin,

      UNIQUE KEY (name)

    );

 

The ascii_bin collation is a case sensitive collation for ascii. The following will not find anything:

 

    SELECT * FROM my_users WHERE name='david' AND plainPassword='MYPASS';

 

Holding a plain text password in your database is not a best practice, but apparently it’s common.

 

Collations also deal with text ordering. For any two strings, the collation determines which is larger, or if they are equal. Probably the most common situation you see collations in action is when you ORDER BY a text column.

Also keep in mind

 

    * When you check for length of strings, do you use the LENGTH() function, as in SELECT LENGTH(Name) FROM City? You probably wish to replace this with CHAR_LENGTH(). LENGTH() returns the number of bytes required for the text storage. CHAR_LENGTH() returns the number of characters in the text, and is usually what you are looking for. It may hold true that for a string s, LENGTH(s)=12 and CHAR_LENGTH(s)=8. Watch out for these glitches.

    * You can converts texts between character sets with CONVERT. For example: CONVERT(s USING utf8)

    * Stored routines should not be overlooked. If your stored routine accepts a text argument, or if your stored function returns one, make sure the character sets are properly defined. If not, then your utf8 text may be converted to latin1 during the call to your stored routine. This also applies to local parameters within the stored routines.

    * An ALTER TABLE <some table> CONVERT TO <some charset> will change the character set not only for the table itself, but also for all existing textual columns.

 

See the following post: Useful database analysis queries with INFORMATION_SCHEMA for queries which diagnose your databases character sets.

 

tags: Data Types, SQL, Syntax

posted in MySQL by shlomi

 

Follow comments via the RSS Feed | Leave a comment | Trackback URL

9 Comments to "MySQL’s character sets and collations demystified"

 

  1.

 

      Jason Stubbs wrote:

 

      It’s also worth noting that the difference between CHAR/VARCHAR space savings only occur on disk as fields are always fixed width in memory. This means that even though setting the md5 password column to be a VARCHAR(32) CHARSET utf8 wouldn’t be much different to CHAR(32) CHARSET ascii on disk, in memory the usage would still be three times.

 

      Link | December 8th, 2008 at 9:59 am

  2.

 

      shlomi wrote:

 

      Jason, thanks, good point.

 

      It even goes beyond that: whenever MySQL creates a temporary table due to non-indexed ORDER BY, or a VIEW query etc., it allocates the full extent for VARCHAR, as if it were a CHAR field. This means that expensive operations which must pay with temporary tables, may become prohibitively more expensive.

 

      Link | December 8th, 2008 at 10:06 am

  3.

 

      The Developer Day » Blog Archive » MySQL character sets and collations wrote:

 

      [...] great article i found about MySQL character sets and collations that demystifies all the so often found problems in projects made by confused or not so [...]

 

      Link | December 8th, 2008 at 2:05 pm

  4.

 

      Chris wrote:

 

      An md5 of the password is not actually ASCII but hexadecimal. You would save twice the space if you saved it as a binary field and used HEX() and UNHEX() to insert and read the data.

 

      Link | December 8th, 2008 at 10:17 pm

  5.

 

      Vladislav Vaintroub wrote:

 

      Nope, there is no Klingon in Unicode.

 

      Link | December 9th, 2008 at 12:55 am

  6.

 

      Vladislav Vaintroub wrote:

 

      And there will probably never be Klingon in Unicode. They submitted their proposal for 4.0 and it got rejected at that time and later for 5.0. The language is too weird and unreal to be included

 

      Link | December 9th, 2008 at 12:58 am

  7.

 

      shlomi wrote:

 

      Vladislav,

      Like I said, it may yet find its way into Unicode… :)

      There’s Tolkien’s elves language, as well, which is competing for Unicode as well, as far as I know.

 

      Regards

 

      Link | December 9th, 2008 at 6:15 am

  8.

 

      shlomi wrote:

 

      Chris,

 

      Thanks, indeed so.

      MySQL’s PASSWORD() function also returns hexadecimal text, but nevertheless the passwords are stored as CHAR(41), not BINARY(20).

      It’s common use to store MD5 as text. I’m not sure myself if it should be so.

      See my earlier post: http://code.openark.org/blog/?p=85

 

      Shlomi

 

      Link | December 9th, 2008 at 6:19 am

  9.

 

      PHP Freaks Forums > SQL / Database Forums > MySQL Help > Topic: MySQL’s character sets and collations demystified wrote:

 

      This article has been ’sticked’ on phpfreaks forum.

 

      Link | December 9th, 2008 at 10:14 pm

 

Leave Your Comment

 

Name (required)

 

Mail (will not be published) (required)

 

Website

 

Notify me of followup comments via e-mail

 

Powered by Wordpress and MySQL. Theme by openark.org

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.