ocpaul20 Posted August 14, 2008 Share Posted August 14, 2008 A puzzle for the day. I need to extract each chinese character individually from a file. In this example, I have 10 chinese characters in a file (CEDICT) and they are represented by 31 hex bytes when I hex-dump the file. 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90 Usually these chinese chars are 3 bytes each, although some are 4. How do I tell which one is a 4-byter and which are 3-byters please? I guess it is something to do with utf-8 encoding but I would not know how to start to determine this. Thanks for any help on this. Paul Link to comment https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/ Share on other sites More sharing options...
toplay Posted August 14, 2008 Share Posted August 14, 2008 Why not just read the whole line and then use substr() or any of the string functions. http://us3.php.net/manual/en/function.substr.php http://us3.php.net/manual/en/function.utf8-decode.php Link to comment https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/#findComment-616235 Share on other sites More sharing options...
ocpaul20 Posted August 14, 2008 Author Share Posted August 14, 2008 Ok, I'll give that whirl. Thanks. Link to comment https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/#findComment-616239 Share on other sites More sharing options...
ocpaul20 Posted August 14, 2008 Author Share Posted August 14, 2008 I need to use the mb_ extension library functions mb_strlen() to get the correct length etc Link to comment https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/#findComment-616299 Share on other sites More sharing options...
effigy Posted August 14, 2008 Share Posted August 14, 2008 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <pre> <?php $data = '0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90'; ### Separate. $codes = preg_split('/\p{Z}+/', $data); ### Convert to characters. $result = ''; foreach ($codes as $code) { $result .= pack('C', $code + 0); } echo $result, '<hr>'; ### Separate into characters. print_r(preg_split('/(?=\p{L})/', $result, -1, PREG_SPLIT_NO_EMPTY)); ?> </pre> Link to comment https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/#findComment-616582 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.