neoform Posted September 24, 2007 Share Posted September 24, 2007 Just wondering, does UTF-8 encoded strings classify as a multi-byte string? Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/ Share on other sites More sharing options...
effigy Posted September 24, 2007 Share Posted September 24, 2007 Yes Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354247 Share on other sites More sharing options...
neoform Posted September 24, 2007 Author Share Posted September 24, 2007 Yes So.. are there any special steps I need to take when dealing with them? I've noticed all the chars are 1 byte with exception to a few which are 2 bytes long, it's really messing with me since I can't figure out how to tell which are the 2 byte vs 1 byte chars without dissecting the string.. Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354258 Share on other sites More sharing options...
effigy Posted September 24, 2007 Share Posted September 24, 2007 Either use the mb_* functions, or possibly decode the string first. My UTF-8 work with PHP has been minimal. What are you trying to do? Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354276 Share on other sites More sharing options...
neoform Posted September 24, 2007 Author Share Posted September 24, 2007 Either use the mb_* functions, or possibly decode the string first. My UTF-8 work with PHP has been minimal. What are you trying to do? I'm grabbing a UTF-8 feed, parsing it's contents with mb_substr, and mb_strlen, but the one that's messing things up (i think, not sure) is wordwrap().. It doesn't seem to be counting the bytes per line properly.. Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354280 Share on other sites More sharing options...
effigy Posted September 24, 2007 Share Posted September 24, 2007 What if you decode, wordwrap, then encode? Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354285 Share on other sites More sharing options...
neoform Posted September 24, 2007 Author Share Posted September 24, 2007 Nah.. while this works: echo utf8_decode(utf8_encode("Iñtërnâtiônàlizætiøn")); this doesn't.. echo utf8_decode(wordwrap(utf8_encode("Iñtërnâtiônàlizætiøn"))); the unicode chars get fried by wordwrap.. Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354297 Share on other sites More sharing options...
effigy Posted September 24, 2007 Share Posted September 24, 2007 How about something like this? I think it will cover marks and combining characters as well. <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <?php $mb_str = utf8_decode(utf8_encode("Iñtërnâtiônàlizætiøn")); $lines = preg_split('/(\X{5})/', $mb_str, -1, PREG_SPLIT_DELIM_CAPTURE); foreach ($lines as $line) { echo utf8_encode($line); echo '< br>'; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354308 Share on other sites More sharing options...
neoform Posted September 24, 2007 Author Share Posted September 24, 2007 No good, the chars still get mashed.. this is the best example of it going sour. each line should have exactly 10 chars.. but since they're unicode, it has less due to the double bytes.. echo utf8_decode(wordwrap(utf8_encode("Iñtërnâtiônàlizætiøn"), 10, '<br />', true)); I need to find a utf8 version of wordwrap, or write on (ugg) myself.. :S Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354319 Share on other sites More sharing options...
effigy Posted September 24, 2007 Share Posted September 24, 2007 The example I posted worked for me. Did you fix the br tag? I had to add a space so that the forum would not process it. Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354352 Share on other sites More sharing options...
neoform Posted September 24, 2007 Author Share Posted September 24, 2007 Maybe I'm missing something... but no. I need it to operate the same way as wordwrap does, but with MB strings......... :S Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354452 Share on other sites More sharing options...
neoform Posted September 25, 2007 Author Share Posted September 25, 2007 Couldn't find a function anywhere so I wrote my own. Here's a multibyte wordwrap: function mb_wordwrap($str, $width = 70, $break = "\n", $cut = false) { $return = ''; $str_bytes = strlen($str); $first_char = true; $current_line = ''; $current_line_char_count = 0; $current_word = ''; $current_word_char_count = 0; for ($i=0; $i < $str_bytes; $i++) { //get the next char (unicode or ascii) $char = $str{$i}; $h = ord($char); if ($h <= 0x7F) { $char_code = $h; } else if ($h < 0xC2) { $char_code = false; } else if ($h <= 0xDF) { $c2 = $str{++$i}; $char .= $c2; $char_code = ($h & 0x1F) << 6 | (ord($c2) & 0x3F); } else if ($h <= 0xEF) { $c2 = $str{++$i}; $c3 = $str{++$i}; $char .= $c2.$c3.$c4; $char_code = ($h & 0x0F) << 12 | (ord($c2) & 0x3F) << 6 | (ord($c3) & 0x3F); } else if ($h <= 0xF4) { $c2 = $str{++$i}; $c3 = $str{++$i}; $c4 = $str{++$i}; $char .= $c2.$c3.$c4; $char_code = ($h & 0x0F) << 18 | (ord($c2) & 0x3F) << 12 | (ord($c3) & 0x3F) << 6 | (ord($c4) & 0x3F); } else { //unrecognized char, skip it continue; } //if it's a space, new word commencing if ($char_code == 32) { //if line is too long, linebreak time! if ($current_line_char_count + $current_word_char_count >= $width) { if ($current_line_char_count) { $return .= $current_line.$break; } //reset the current line $current_line = $current_word; $current_line_char_count = $current_word_char_count; } else { //include a space at the front of the word if this isn't the first char //since we assume there was a space prior to this word except for the first word $current_line .= ($first_char ? '' : ' ').$current_word; $current_line_char_count += $current_word_char_count + ($first_char ? 0 : 1); } $current_word = ''; $current_word_char_count = 0; $first_char = false; } //if it's a char, add it to the word else { if ($cut) { //check if this word is too long. if it is, slice it. if ($current_word_char_count >= $width) { //clear the current line and word to the return value if ($current_line_char_count) { $return .= $current_line.$break; } $current_line = $current_word; $current_line_char_count = $current_word_char_count; $current_word = ''; $current_word_char_count = 0; } } $current_word .= $char; $current_word_char_count++; } } //check for leftovers and add them to the string if ($current_word_char_count) { $return .= $current_line.($current_word_char_count ? ($current_word_char_count + $current_line_char_count > $width ? "\n" : ' ').$current_word : ''); } return $return; } Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354929 Share on other sites More sharing options...
effigy Posted September 25, 2007 Share Posted September 25, 2007 I still don't understand why mine didn't work Does your function cover marks and combining characters? Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354960 Share on other sites More sharing options...
neoform Posted September 25, 2007 Author Share Posted September 25, 2007 try it out, it handles up to 4byte unicode chars. yours wrapped the text, but it didn't follow the rules set in wordwrap, which is to not cut words unnecessarily. this only breaks a word if it has to. Quote Link to comment https://forums.phpfreaks.com/topic/70520-solved-utf-8-mb-string/#findComment-354980 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.