eyrique Posted October 3, 2013 Share Posted October 3, 2013 (edited) Hi, I had written a function to calculate total words in a string. For example: 祝你 Happy Birthday - considered as 4 words祝你生日快樂 - considered as 6 wordsHappy Birthday 帥哥 2013 - considered as 5 wordsHappy Birthday 09/09/13 - considered as 3 words The problem is that I'd tested it on localhost (using AppServ Windows 7) and it works perfectly. But when I upload it to the server, the chinese words are calculated wrongly, example: 祝你 Happy Birthday - become as 3 words祝你生日快樂 - become as 1 wordHappy Birthday 帥哥 2013 - become as 4 words Not sure what's wrong with it. Can someone help on this? Here's my code: function count_total_word($txt){ $total = count(preg_split('~[\p{Z}\p{P}]+~u', $txt, null, PREG_SPLIT_NO_EMPTY)) + 1; // Count Words $total -= count(preg_split('~[/]+~u', $txt, null, PREG_SPLIT_NO_EMPTY)); // Ignore "/" return $total; } Localhost PHP: 5.2.6 Server PHP: 5.2.17 Thanks Edited October 3, 2013 by eyrique Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/ Share on other sites More sharing options...
.josh Posted October 3, 2013 Share Posted October 3, 2013 Sounds like on your server PCRE is not compiled with "--enable-unicode-properties" enabled. edit: and possibly also "--enable-utf8". Well, you aren't getting errors thrown at you so you prolly already have this one enabled. Check out this article: http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/ Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452436 Share on other sites More sharing options...
jazzman1 Posted October 3, 2013 Share Posted October 3, 2013 Are you sure, that this string "祝你生日快樂" contains itself 6 words? Is there letters in this language like in European languages? Instead using PCRE (Perl Compatible Regular Expressions) you have to consider using multibyte character encoding schemes and some multibyte string functions in php. Have a look this example: $str = "祝你生日快樂"; echo strlen($str); // 18 echo '<br />'; echo mb_strlen($str, 'utf8'); // 6 What's wrong and right you should tell us Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452478 Share on other sites More sharing options...
eyrique Posted October 4, 2013 Author Share Posted October 4, 2013 Sounds like on your server PCRE is not compiled with "--enable-unicode-properties" enabled. edit: and possibly also "--enable-utf8". Well, you aren't getting errors thrown at you so you prolly already have this one enabled. Check out this article: http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/ I check on my server $ pcretest -C and got this following output: PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack I suppose this is unicode-properties enabled? Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452500 Share on other sites More sharing options...
eyrique Posted October 4, 2013 Author Share Posted October 4, 2013 Are you sure, that this string "祝你生日快樂" contains itself 6 words? Is there letters in this language like in European languages? Instead using PCRE (Perl Compatible Regular Expressions) you have to consider using multibyte character encoding schemes and some multibyte string functions in php. Have a look this example: $str = "祝你生日快樂"; echo strlen($str); // 18 echo '<br />'; echo mb_strlen($str, 'utf8'); // 6 What's wrong and right you should tell us Technically this string should be considered as 6 character instead of 6 words, but the Chinese don't count it that way. I'd tried mb_strlen() but that work for English words as this function is calculating the characters. Here's what we are looking for: 祝你生日快樂 - 6 words Happy Birthday - 2 words Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452501 Share on other sites More sharing options...
jazzman1 Posted October 4, 2013 Share Posted October 4, 2013 (edited) So, if I understand you correctly there is no letters in Chinese language, right? Have you ever checked these multibyte string functions? You can apply a rule if the string contains itself EN characters use str_word_count() if they are Chinese use mb_strlen(). <?php $str_en = "Happy Birdday"; $str_ch = '祝你生日快樂'; echo mb_strlen($str_c,'utf8'); // 6 Chinese words (characters) echo str_word_count($str_e); // two EN words Edited October 4, 2013 by jazzman1 Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452502 Share on other sites More sharing options...
eyrique Posted October 4, 2013 Author Share Posted October 4, 2013 So, if I understand you correctly there is no letters in Chinese language, right? Have you ever checked these multibyte string functions? There is You can apply a rule if the string contains itself EN characters use str_word_count() if they are Chinese use mb_strlen(). <?php $str_en = "Happy Birdday"; $str_ch = '祝你生日快樂'; echo mb_strlen($str_c,'utf8'); // 6 Chinese words (characters) echo str_word_count($str_e); // two EN words The string might contain combination of chinese & english words, or just a single language. Eg: 祝你生日快樂 (6 words) 祝你 Happy Birthday (4 words) Happy Birthday (2 words) Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452504 Share on other sites More sharing options...
jazzman1 Posted October 4, 2013 Share Posted October 4, 2013 (edited) Then count (gather) the result of Chinese and EN words and display the output! Edited October 4, 2013 by jazzman1 Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452505 Share on other sites More sharing options...
eyrique Posted October 4, 2013 Author Share Posted October 4, 2013 Then count (gather) the result of Chinese and EN words and display the output! how do you do that? Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452506 Share on other sites More sharing options...
jazzman1 Posted October 4, 2013 Share Posted October 4, 2013 Both functions return integers! <?php $str_en = "Happy Birdday"; $str_ch = '祝你生日快樂'; var_dump(mb_strlen($str_ch,'utf8')); // int(6) var_dump(str_word_count($str_en)); // int(2) echo mb_strlen($str_ch,'utf8') + str_word_count($str_en); // 8 Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452507 Share on other sites More sharing options...
eyrique Posted October 4, 2013 Author Share Posted October 4, 2013 Thanks~ got it Quote Link to comment https://forums.phpfreaks.com/topic/282674-counting-chinese-words/#findComment-1452508 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.