How I deal with foreign characters

david85 · August 16, 2012

I'm kinda lost to how I should deal with foreign characters with regards to PHP/Mysql.

First of all I'm receiving a plain text file which I'm guessing could potentially be received in a range of encoding formats. It seems like it's rather difficult to accurately detect the encoding type but there is mb_detect_encoding(). Potentially I could just stick to the 'likely' default encoding saved by notepad which at the moment seems to be ANSI or rather extended ANSI.

I'm hoping someone could perhaps give me any pointers to hopefully not spend the next 3 days in a ramble of pages...

Christian F. · August 16, 2012

Just make sure that you use UTF-8 at every single step along the way, and there shouldn't be any problems.

That means:

[*]Send the

content-type: text/html; charset=utf8 to the browser.

[*]Make sure your editor saves all files as UTF-8, without the BOM.

[*]If you're using a database: Define all tables using the utf-8 charset, and immediately use set_charset () after opening a connection from PHP.

[*]Use MB-aware functions at all times, and (if possible) specify UTF-8 as the charset you're working with.

[*]If you work against external systems/files, over which you have no control, then first detect the charset of said files before converting to UTF-8 (if necessary).

Notepad doesn't use ASCII or extended ASCII, but the default windows charset. Which differs depending upon which regional language your Windows install is set to use. They are indeed basic ASCII compatible, just as UTF-8 is, but anything above a bit value of 126 will cause problems.

btherl · August 17, 2012

Detecting encoding is tricky business, and mb_detect_encoding() probably won't help. The problem is that EVERY document is valid in the ISO 8859 and windows code pages, they just end up with different characters. Not all documents are valid UTF8 though, so you can sometimes rule UTF8 out. Trying to decode the document as UTF8 is a good start, because if that fails you know it's not UTF8, it's something else. Another good start is to check if there is nothing other than plain english characters - if that's all there is you don't need to do anything.

If you can, do what ChristianF is suggesting and make sure you know the encoding and don't need to detect it.

Also if you know what languages you are dealing with it helps. Eg if you are just dealing with chinese or japanese, there are specific encodings they usually use and you can distinguish them more easily. And many languages have a standard code page that everyone uses.

Sign In

How I deal with foreign characters

Recommended Posts

david85

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information