Jump to content


Photo

Counting Chinese Words


  • Please log in to reply
10 replies to this topic

#1 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 03:40 AM

Hi,
 
I had written a function to calculate total words in a string. For example:
 
祝你 Happy Birthday - considered as 4 words
祝你生日快樂 - considered as 6 words
Happy Birthday 帥哥 2013 - considered as 5 words
Happy Birthday 09/09/13 - considered as 3 words
 
The problem is that I'd tested it on localhost (using AppServ Windows 7) and it works perfectly. But when I upload it to the server, the chinese words are calculated wrongly, example:
 
祝你 Happy Birthday - become as 3 words
祝你生日快樂 - become as 1 word
Happy Birthday 帥哥 2013 - become as 4 words
 
Not sure what's wrong with it.
 
Can someone help on this?
 
Here's my code:
 

function count_total_word($txt){
		
		$total = count(preg_split('~[\p{Z}\p{P}]+~u', $txt, null, PREG_SPLIT_NO_EMPTY)) + 1; // Count Words
		$total -= count(preg_split('~[/]+~u', $txt, null, PREG_SPLIT_NO_EMPTY)); // Ignore "/"
		return $total;
	} 

 
Localhost PHP: 5.2.6

Server PHP: 5.2.17
 
Thanks


Edited by eyrique, 03 October 2013 - 03:51 AM.


#2 .josh

.josh

    .josh

  • Staff Alumni
  • 14,830 posts

Posted 03 October 2013 - 10:15 AM

Sounds like on your server PCRE is not compiled with "--enable-unicode-properties" enabled.

edit: and possibly also "--enable-utf8". Well, you aren't getting errors thrown at you so you prolly already have this one enabled.

Check out this article: http://chrisjean.com...h-php-and-pcre/

Did I help you? Feeling generous? Donate to me! || Donate to phpfreaks!
Please, take the time and do some research and find out how much it would have cost you to get your help from a decent paid-for source. A "roll-of-the-dice" freelancer will charge you $5-$15/hr. A decent entry level freelancer will charge you around $15-30/hr. A professional will charge you anywhere from $50-$100/hr. An agency will charge anywhere from $100-$250/hr. Think about all this when soliciting for help here. Think about how much money you are making from the work you are asking for help on. No, we do not expect you to pay for the help given here, but donating a few bucks is a fraction of the cost of what you would have paid, shows your appreciation, helps motivate people to keep offering help without the pricetag, and helps make this a higher quality free-help community :)

#3 jazzman1

jazzman1

    Advanced Member

  • Gurus
  • 2,664 posts
  • LocationMississauga, Canada

Posted 03 October 2013 - 03:10 PM

Are you sure, that this string "祝你生日快樂" contains itself 6 words? Is there letters in this language like in European languages?

Instead using PCRE (Perl Compatible Regular Expressions) you have to consider using multibyte character encoding schemes and some multibyte string functions in php.

Have a look this example:

$str = "祝你生日快樂";
echo strlen($str); // 18
echo '<br />';
echo mb_strlen($str, 'utf8'); // 6

What's wrong and right you should tell us :)



#4 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 07:48 PM

Sounds like on your server PCRE is not compiled with "--enable-unicode-properties" enabled.

edit: and possibly also "--enable-utf8". Well, you aren't getting errors thrown at you so you prolly already have this one enabled.

Check out this article: http://chrisjean.com...h-php-and-pcre/

 

I check on my server

$ pcretest -C
 
and got this following output:
 
PCRE version 6.6 06-Feb-2006
Compiled with
  UTF-8 support
  Unicode properties support
  Newline character is LF
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack
 
I suppose this is unicode-properties enabled?


#5 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 07:51 PM

Are you sure, that this string "祝你生日快樂" contains itself 6 words? Is there letters in this language like in European languages?

Instead using PCRE (Perl Compatible Regular Expressions) you have to consider using multibyte character encoding schemes and some multibyte string functions in php.

Have a look this example:

$str = "祝你生日快樂";
echo strlen($str); // 18
echo '<br />';
echo mb_strlen($str, 'utf8'); // 6

What's wrong and right you should tell us :)

 

Technically this string should be considered as 6 character instead of 6 words, but the Chinese don't count it that way. 

 

I'd tried mb_strlen() but that work for English words as this function is calculating the characters.

 

Here's what we are looking for:

祝你生日快樂 - 6 words
Happy Birthday - 2 words



#6 jazzman1

jazzman1

    Advanced Member

  • Gurus
  • 2,664 posts
  • LocationMississauga, Canada

Posted 03 October 2013 - 09:02 PM

So, if I understand you correctly there is no letters in Chinese language, right?

 

Have you ever checked these multibyte string functions?

 

You can apply a rule if the string contains itself EN characters use str_word_count() if they are Chinese use mb_strlen(). 

<?php

$str_en = "Happy Birdday";

$str_ch = '祝你生日快樂';

echo mb_strlen($str_c,'utf8'); // 6 Chinese words (characters)

echo str_word_count($str_e); // two EN words

Edited by jazzman1, 03 October 2013 - 09:07 PM.


#7 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 09:07 PM

 

So, if I understand you correctly there is no letters in Chinese language, right?

 

Have you ever checked these multibyte string functions? There is

 

You can apply a rule if the string contains itself EN characters use str_word_count() if they are Chinese use mb_strlen(). 

<?php

$str_en = "Happy Birdday";

$str_ch = '祝你生日快樂';

echo mb_strlen($str_c,'utf8'); // 6 Chinese words (characters)

echo str_word_count($str_e); // two EN words

 

The string might contain combination of chinese & english words, or just a single language.

 

Eg: 

祝你生日快樂 (6 words)

祝你 Happy Birthday (4 words)

Happy Birthday (2 words)



#8 jazzman1

jazzman1

    Advanced Member

  • Gurus
  • 2,664 posts
  • LocationMississauga, Canada

Posted 03 October 2013 - 09:12 PM

Then count (gather) the result of Chinese and EN words and display the output!


Edited by jazzman1, 03 October 2013 - 09:14 PM.


#9 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 09:17 PM

Then count (gather) the result of Chinese and EN words and display the output!

 

how do you do that?



#10 jazzman1

jazzman1

    Advanced Member

  • Gurus
  • 2,664 posts
  • LocationMississauga, Canada

Posted 03 October 2013 - 09:27 PM

Both functions return integers!

<?php

$str_en = "Happy Birdday";

$str_ch = '祝你生日快樂';

var_dump(mb_strlen($str_ch,'utf8')); // int(6)

var_dump(str_word_count($str_en)); // int(2)

echo mb_strlen($str_ch,'utf8') + str_word_count($str_en); // 8


#11 eyrique

eyrique

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 October 2013 - 10:05 PM

Thanks~ got it  :happy-04:






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

Cheap Linux VPS from $5
SSD Storage, 30 day Guarantee
1 TB of BW, 100% Network Uptime

AlphaBit.com