Jump to content

[SOLVED] UTF-8 = MB String?


neoform

Recommended Posts

Yes

 

So..  are there any special steps I need to take when dealing with them? I've noticed all the chars are 1 byte with exception to a few which are 2 bytes long, it's really messing with me since I can't figure out how to tell which are the 2 byte vs 1 byte chars without dissecting the string..

Link to comment
Share on other sites

Either use the mb_* functions, or possibly decode the string first. My UTF-8 work with PHP has been minimal. What are you trying to do?

 

I'm grabbing a UTF-8 feed, parsing it's contents with mb_substr, and mb_strlen, but the one that's messing things up (i think, not sure) is wordwrap()..  It doesn't seem to be counting the bytes per line properly..

Link to comment
Share on other sites

How about something like this? I think it will cover marks and combining characters as well.

 

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

<?php

$mb_str = utf8_decode(utf8_encode("Iñtërnâtiônàlizætiøn"));

$lines = preg_split('/(\X{5})/', $mb_str, -1, PREG_SPLIT_DELIM_CAPTURE);

foreach ($lines as $line) {

echo utf8_encode($line);

echo '< br>';

}

?>

Link to comment
Share on other sites

No good, the chars still get mashed..

 

this is the best example of it going sour.

 

each line should have exactly 10 chars.. but since they're unicode, it has less due to the double bytes..

 

echo utf8_decode(wordwrap(utf8_encode("Iñtërnâtiônàlizætiøn"), 10, '<br />', true));

 

I need to find a utf8 version of wordwrap, or write on (ugg) myself.. :S

Link to comment
Share on other sites

Couldn't find a function anywhere so I wrote my own.

 

Here's a multibyte wordwrap:

 

function mb_wordwrap($str, $width = 70, $break = "\n", $cut = false)
{
	$return = '';
	$str_bytes = strlen($str);
	$first_char = true;

	$current_line = '';
	$current_line_char_count = 0;
	$current_word = '';
	$current_word_char_count = 0;

	for ($i=0; $i < $str_bytes; $i++)
	{
		//get the next char (unicode or ascii)
		$char = $str{$i};
		$h = ord($char);
		if ($h <= 0x7F) 
		{ $char_code = $h; } 
		else if ($h < 0xC2) 
		{ $char_code = false; } 
		else if ($h <= 0xDF) 
		{ 
			$c2 = $str{++$i};
			$char .= $c2;
			$char_code = ($h & 0x1F) << 6 | (ord($c2) & 0x3F); 
		} 
		else if ($h <= 0xEF) 
		{ 
			$c2 = $str{++$i};
			$c3 = $str{++$i};
			$char .= $c2.$c3.$c4;
			$char_code = ($h & 0x0F) << 12 | (ord($c2) & 0x3F) << 6 | (ord($c3) & 0x3F); 
		} 
		else if ($h <= 0xF4) 
		{ 
			$c2 = $str{++$i};
			$c3 = $str{++$i};
			$c4 = $str{++$i};
			$char .= $c2.$c3.$c4;
			$char_code = ($h & 0x0F) << 18 | (ord($c2) & 0x3F) << 12 | (ord($c3) & 0x3F) << 6 | (ord($c4) & 0x3F); 
		} 
		else 
		{ 
			//unrecognized char, skip it
			continue; 
		}

		//if it's a space, new word commencing
		if ($char_code == 32)
		{
			//if line is too long, linebreak time!
			if ($current_line_char_count + $current_word_char_count >= $width) 
			{
				if ($current_line_char_count)
				{ $return .= $current_line.$break; }

				//reset the current line
				$current_line = $current_word;
				$current_line_char_count = $current_word_char_count;
			}
			else
			{
				//include a space at the front of the word if this isn't the first char
				//since we assume there was a space prior to this word except for the first word
				$current_line .= ($first_char ? '' : ' ').$current_word; 
				$current_line_char_count += $current_word_char_count + ($first_char ? 0 : 1);				
			}

			$current_word = '';
			$current_word_char_count = 0;

			$first_char = false;
		}
		//if it's a char, add it to the word
		else
		{ 
			if ($cut)
			{
				//check if this word is too long. if it is, slice it.
				if ($current_word_char_count >= $width)
				{
					//clear the current line and word to the return value
					if ($current_line_char_count)
					{ $return .= $current_line.$break; }

					$current_line = $current_word;
					$current_line_char_count = $current_word_char_count;

					$current_word = '';
					$current_word_char_count = 0;
				}
			}

			$current_word .= $char; 
			$current_word_char_count++;
		}
	}

	//check for leftovers and add them to the string
	if ($current_word_char_count)
	{ $return .= $current_line.($current_word_char_count ? ($current_word_char_count + $current_line_char_count > $width ? "\n" : ' ').$current_word : ''); }

	return $return;
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.