3 or 4-byte chinese characters?

ocpaul20 · August 14, 2008

A puzzle for the day.

I need to extract each chinese character individually from a file.

In this example, I have 10 chinese characters in a file (CEDICT) and they are represented by 31 hex bytes when I hex-dump the file.

0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90

Usually these chinese chars are 3 bytes each, although some are 4.

How do I tell which one is a 4-byter and which are 3-byters please?

I guess it is something to do with utf-8 encoding but I would not know how to start to determine this.

Thanks for any help on this.

Paul

toplay · August 14, 2008

Why not just read the whole line and then use substr() or any of the string functions.

http://us3.php.net/manual/en/function.substr.php

http://us3.php.net/manual/en/function.utf8-decode.php

ocpaul20 · August 14, 2008

Ok, I'll give that whirl. Thanks.

ocpaul20 · August 14, 2008

I need to use the mb_ extension library functions

mb_strlen() to get the correct length etc

effigy · August 14, 2008

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<pre>
<?php
$data = '0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90';
### Separate.
$codes = preg_split('/\p{Z}+/', $data);
### Convert to characters.
$result = '';
foreach ($codes as $code) {
	$result .= pack('C', $code + 0);
}
echo $result, '<hr>';
### Separate into characters.
print_r(preg_split('/(?=\p{L})/', $result, -1, PREG_SPLIT_NO_EMPTY));
?> 
</pre>

Sign In

3 or 4-byte chinese characters?

Recommended Posts

ocpaul20

Link to comment

Share on other sites

toplay

Link to comment

Share on other sites

ocpaul20

Link to comment

Share on other sites

ocpaul20

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information