david85 Posted August 16, 2012 Share Posted August 16, 2012 I'm kinda lost to how I should deal with foreign characters with regards to PHP/Mysql. First of all I'm receiving a plain text file which I'm guessing could potentially be received in a range of encoding formats. It seems like it's rather difficult to accurately detect the encoding type but there is mb_detect_encoding(). Potentially I could just stick to the 'likely' default encoding saved by notepad which at the moment seems to be ANSI or rather extended ANSI. I'm hoping someone could perhaps give me any pointers to hopefully not spend the next 3 days in a ramble of pages... Quote Link to comment https://forums.phpfreaks.com/topic/267171-how-i-deal-with-foreign-characters/ Share on other sites More sharing options...
Christian F. Posted August 16, 2012 Share Posted August 16, 2012 Just make sure that you use UTF-8 at every single step along the way, and there shouldn't be any problems. That means: [*]Send the content-type: text/html; charset=utf8 to the browser. [*]Make sure your editor saves all files as UTF-8, without the BOM. [*]If you're using a database: Define all tables using the utf-8 charset, and immediately use set_charset () after opening a connection from PHP. [*]Use MB-aware functions at all times, and (if possible) specify UTF-8 as the charset you're working with. [*]If you work against external systems/files, over which you have no control, then first detect the charset of said files before converting to UTF-8 (if necessary). Notepad doesn't use ASCII or extended ASCII, but the default windows charset. Which differs depending upon which regional language your Windows install is set to use. They are indeed basic ASCII compatible, just as UTF-8 is, but anything above a bit value of 126 will cause problems. Quote Link to comment https://forums.phpfreaks.com/topic/267171-how-i-deal-with-foreign-characters/#findComment-1369885 Share on other sites More sharing options...
btherl Posted August 17, 2012 Share Posted August 17, 2012 Detecting encoding is tricky business, and mb_detect_encoding() probably won't help. The problem is that EVERY document is valid in the ISO 8859 and windows code pages, they just end up with different characters. Not all documents are valid UTF8 though, so you can sometimes rule UTF8 out. Trying to decode the document as UTF8 is a good start, because if that fails you know it's not UTF8, it's something else. Another good start is to check if there is nothing other than plain english characters - if that's all there is you don't need to do anything. If you can, do what ChristianF is suggesting and make sure you know the encoding and don't need to detect it. Also if you know what languages you are dealing with it helps. Eg if you are just dealing with chinese or japanese, there are specific encodings they usually use and you can distinguish them more easily. And many languages have a standard code page that everyone uses. Quote Link to comment https://forums.phpfreaks.com/topic/267171-how-i-deal-with-foreign-characters/#findComment-1370049 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.