Jump to content

3 or 4-byte chinese characters?


ocpaul20

Recommended Posts

A puzzle for the day.

 

I need to extract each chinese character individually from a file.

 

In this example, I have 10 chinese characters in a file (CEDICT) and they are represented by 31 hex bytes when I hex-dump the file.

0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90

 

Usually these chinese chars are 3 bytes each, although some are 4.

How do I tell which one is a 4-byter and which are 3-byters please?

 

I guess it is something to do with utf-8 encoding but I would not know how to start to determine this.

Thanks for any help on this.

Paul

 

 

Link to comment
https://forums.phpfreaks.com/topic/119612-3-or-4-byte-chinese-characters/
Share on other sites

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<pre>
<?php
$data = '0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe6 0x9d 0xa5 0xe6 0x8f 0x90 0x2e 0xe9 0x98 0xbf 0xe4 0xb8 0x8d 0xe9 0x83 0xbd 0xe7 0x83 0xad 0xe8 0xa5 0xbf 0xe6 0x8f 0x90';
### Separate.
$codes = preg_split('/\p{Z}+/', $data);
### Convert to characters.
$result = '';
foreach ($codes as $code) {
	$result .= pack('C', $code + 0);
}
echo $result, '<hr>';
### Separate into characters.
print_r(preg_split('/(?=\p{L})/', $result, -1, PREG_SPLIT_NO_EMPTY));
?> 
</pre>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.