Jump to content

How I deal with foreign characters


david85

Recommended Posts

I'm kinda lost to how I should deal with foreign characters with regards to PHP/Mysql.

 

First of all I'm receiving a plain text file which I'm guessing could potentially be received in a range of encoding formats. It seems like it's rather difficult to accurately detect the encoding type but there is  mb_detect_encoding(). Potentially I could just stick to the 'likely' default encoding saved by notepad which at the moment seems to be ANSI or rather extended ANSI.

 

I'm hoping someone could perhaps give me any pointers to hopefully not spend the next 3 days in a ramble of pages...

Link to comment
Share on other sites

Just make sure that you use UTF-8 at every single step along the way, and there shouldn't be any problems.

 

That means:

  1. [*]Send the
content-type: text/html; charset=utf8 to the browser.

[*]Make sure your editor saves all files as UTF-8, without the BOM.

[*]If you're using a database: Define all tables using the utf-8 charset, and immediately use set_charset () after opening a connection from PHP.

[*]Use MB-aware functions at all times, and (if possible) specify UTF-8 as the charset you're working with.

[*]If you work against external systems/files, over which you have no control, then first detect the charset of said files before converting to UTF-8 (if necessary).

 

Notepad doesn't use ASCII or extended ASCII, but the default windows charset. Which differs depending upon which regional language your Windows install is set to use. They are indeed basic ASCII compatible, just as UTF-8 is, but anything above a bit value of 126 will cause problems.

Link to comment
Share on other sites

Detecting encoding is tricky business, and mb_detect_encoding() probably won't help.  The problem is that EVERY document is valid in the ISO 8859 and windows code pages, they just end up with different characters.  Not all documents are valid UTF8 though, so you can sometimes rule UTF8 out.  Trying to decode the document as UTF8 is a good start, because if that fails you know it's not UTF8, it's something else.  Another good start is to check if there is nothing other than plain english characters - if that's all there is you don't need to do anything.

 

If you can, do what ChristianF is suggesting and make sure you know the encoding and don't need to detect it.

 

Also if you know what languages you are dealing with it helps.  Eg if you are just dealing with chinese or japanese, there are specific encodings they usually use and you can distinguish them more easily.  And many languages have a standard code page that everyone uses.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.