Get Position Of First Unicode Character

silkfire · October 22, 2012

I need a way to determine the position of the first unicode character in a string.

For example, if we have '89423io3önska032j' (encoded in UTF-, I need to return number 8.

None of the mb_ functions I found were suitable for this purpose. Hope someone can help me on this one. I also need the function to be performance effective, so no advanced solutions as this will be used in a loop.

requinix · October 22, 2012

If you mean to say "the offset of the first character which has a multi-byte UTF-8 encoding" then find the first byte whose ASCII value is >= 192.

silkfire · October 22, 2012

Mmkay any way to do it with regex? Or what is your suggestion?

ManiacDan · October 22, 2012

If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate.

If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value.

Or just use string functions.

salathe · October 22, 2012

I need a way to determine the position of the first unicode character in a string.

Basic Latin characters (for example, A-Z, 0-9) don't count? Your example hints that they don't, but they're unicode characters too.

For example, if we have '89423io3önska032j' (encoded in UTF-, I need to return number 8.

Ahh, so it looks like you want the offset of the first character which isn't in the "C0 Controls and Basic Latin" table or perhaps some smaller set: just A-Z, a-z, 0-9?

None of the mb_ functions I found were suitable for this purpose.

Which ones did you "find"? I am certain this task could be readily resolved with at least one of the available mb_* functions. Or with PCRE regex (preg_* functions), or regular string inspection and manipulation. What have you tried so far?

silkfire · October 22, 2012

If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate.

If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value.

Or just use string functions.

UTF-8 is variable-width man, take a look here: http://en.wikipedia.org/wiki/UTF-8

But yeah, the basic question is, how do I find the position of the first multi-byte character?

I've tried with some custom mb_explode function and then looped through the array but it's not so performance effective.

requinix · October 22, 2012

Lemme say it a different way:

function strposFirstMbChar($string) {
$len = strlen($string);
for ($i = 0; $i < $len; $i++) {
	if ($string[$i] >= "\xC0") {
		return $i;
	}
}
return false;
}

Sign In

Get Position Of First Unicode Character

Recommended Posts

silkfire

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

silkfire

Link to comment

Share on other sites

ManiacDan

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

silkfire

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information