[SOLVED] UTF-8 = MB String?

neoform · September 24, 2007

Just wondering, does UTF-8 encoded strings classify as a multi-byte string?

effigy · September 24, 2007

Yes

neoform · September 24, 2007

Yes

So.. are there any special steps I need to take when dealing with them? I've noticed all the chars are 1 byte with exception to a few which are 2 bytes long, it's really messing with me since I can't figure out how to tell which are the 2 byte vs 1 byte chars without dissecting the string..

effigy · September 24, 2007

Either use the mb_* functions, or possibly decode the string first. My UTF-8 work with PHP has been minimal. What are you trying to do?

neoform · September 24, 2007

Either use the mb_* functions, or possibly decode the string first. My UTF-8 work with PHP has been minimal. What are you trying to do?

I'm grabbing a UTF-8 feed, parsing it's contents with mb_substr, and mb_strlen, but the one that's messing things up (i think, not sure) is wordwrap().. It doesn't seem to be counting the bytes per line properly..

effigy · September 24, 2007

What if you decode, wordwrap, then encode?

neoform · September 24, 2007

Nah..

while this works:

echo utf8_decode(utf8_encode("Iñtërnâtiônàlizætiøn"));

this doesn't..

echo utf8_decode(wordwrap(utf8_encode("Iñtërnâtiônàlizætiøn")));

the unicode chars get fried by wordwrap..

effigy · September 24, 2007

How about something like this? I think it will cover marks and combining characters as well.

<?php

$mb_str = utf8_decode(utf8_encode("Iñtërnâtiônàlizætiøn"));

$lines = preg_split('/(\X{5})/', $mb_str, -1, PREG_SPLIT_DELIM_CAPTURE);

foreach ($lines as $line) {

echo utf8_encode($line);

echo '< br>';

}

?>

neoform · September 24, 2007

No good, the chars still get mashed..

this is the best example of it going sour.

each line should have exactly 10 chars.. but since they're unicode, it has less due to the double bytes..

echo utf8_decode(wordwrap(utf8_encode("Iñtërnâtiônàlizætiøn"), 10, '<br />', true));

I need to find a utf8 version of wordwrap, or write on (ugg) myself.. :S

effigy · September 24, 2007

The example I posted worked for me. Did you fix the br tag? I had to add a space so that the forum would not process it.

neoform · September 24, 2007

Maybe I'm missing something... but no. I need it to operate the same way as wordwrap does, but with MB strings......... :S

neoform · September 25, 2007

Couldn't find a function anywhere so I wrote my own.

Here's a multibyte wordwrap:

function mb_wordwrap($str, $width = 70, $break = "\n", $cut = false)
{
	$return = '';
	$str_bytes = strlen($str);
	$first_char = true;

	$current_line = '';
	$current_line_char_count = 0;
	$current_word = '';
	$current_word_char_count = 0;

	for ($i=0; $i < $str_bytes; $i++)
	{
		//get the next char (unicode or ascii)
		$char = $str{$i};
		$h = ord($char);
		if ($h <= 0x7F) 
		{ $char_code = $h; } 
		else if ($h < 0xC2) 
		{ $char_code = false; } 
		else if ($h <= 0xDF) 
		{ 
			$c2 = $str{++$i};
			$char .= $c2;
			$char_code = ($h & 0x1F) << 6 | (ord($c2) & 0x3F); 
		} 
		else if ($h <= 0xEF) 
		{ 
			$c2 = $str{++$i};
			$c3 = $str{++$i};
			$char .= $c2.$c3.$c4;
			$char_code = ($h & 0x0F) << 12 | (ord($c2) & 0x3F) << 6 | (ord($c3) & 0x3F); 
		} 
		else if ($h <= 0xF4) 
		{ 
			$c2 = $str{++$i};
			$c3 = $str{++$i};
			$c4 = $str{++$i};
			$char .= $c2.$c3.$c4;
			$char_code = ($h & 0x0F) << 18 | (ord($c2) & 0x3F) << 12 | (ord($c3) & 0x3F) << 6 | (ord($c4) & 0x3F); 
		} 
		else 
		{ 
			//unrecognized char, skip it
			continue; 
		}

		//if it's a space, new word commencing
		if ($char_code == 32)
		{
			//if line is too long, linebreak time!
			if ($current_line_char_count + $current_word_char_count >= $width) 
			{
				if ($current_line_char_count)
				{ $return .= $current_line.$break; }

				//reset the current line
				$current_line = $current_word;
				$current_line_char_count = $current_word_char_count;
			}
			else
			{
				//include a space at the front of the word if this isn't the first char
				//since we assume there was a space prior to this word except for the first word
				$current_line .= ($first_char ? '' : ' ').$current_word; 
				$current_line_char_count += $current_word_char_count + ($first_char ? 0 : 1);				
			}

			$current_word = '';
			$current_word_char_count = 0;

			$first_char = false;
		}
		//if it's a char, add it to the word
		else
		{ 
			if ($cut)
			{
				//check if this word is too long. if it is, slice it.
				if ($current_word_char_count >= $width)
				{
					//clear the current line and word to the return value
					if ($current_line_char_count)
					{ $return .= $current_line.$break; }

					$current_line = $current_word;
					$current_line_char_count = $current_word_char_count;

					$current_word = '';
					$current_word_char_count = 0;
				}
			}

			$current_word .= $char; 
			$current_word_char_count++;
		}
	}

	//check for leftovers and add them to the string
	if ($current_word_char_count)
	{ $return .= $current_line.($current_word_char_count ? ($current_word_char_count + $current_line_char_count > $width ? "\n" : ' ').$current_word : ''); }

	return $return;
}

effigy · September 25, 2007

I still don't understand why mine didn't work Does your function cover marks and combining characters?

neoform · September 25, 2007

try it out, it handles up to 4byte unicode chars.

yours wrapped the text, but it didn't follow the rules set in wordwrap, which is to not cut words unnecessarily. this only breaks a word if it has to.

Sign In

[SOLVED] UTF-8 = MB String?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information