silkfire Posted October 22, 2012 Share Posted October 22, 2012 I need a way to determine the position of the first unicode character in a string. For example, if we have '89423io3önska032j' (encoded in UTF-, I need to return number 8. None of the mb_ functions I found were suitable for this purpose. Hope someone can help me on this one. I also need the function to be performance effective, so no advanced solutions as this will be used in a loop. Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/ Share on other sites More sharing options...
requinix Posted October 22, 2012 Share Posted October 22, 2012 If you mean to say "the offset of the first character which has a multi-byte UTF-8 encoding" then find the first byte whose ASCII value is >= 192. Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1386910 Share on other sites More sharing options...
silkfire Posted October 22, 2012 Author Share Posted October 22, 2012 Mmkay any way to do it with regex? Or what is your suggestion? Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1386929 Share on other sites More sharing options...
ManiacDan Posted October 22, 2012 Share Posted October 22, 2012 If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate. If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value. Or just use string functions. Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1386954 Share on other sites More sharing options...
salathe Posted October 22, 2012 Share Posted October 22, 2012 I need a way to determine the position of the first unicode character in a string. Basic Latin characters (for example, A-Z, 0-9) don't count? Your example hints that they don't, but they're unicode characters too. For example, if we have '89423io3önska032j' (encoded in UTF-, I need to return number 8. Ahh, so it looks like you want the offset of the first character which isn't in the "C0 Controls and Basic Latin" table or perhaps some smaller set: just A-Z, a-z, 0-9? None of the mb_ functions I found were suitable for this purpose. Which ones did you "find"? I am certain this task could be readily resolved with at least one of the available mb_* functions. Or with PCRE regex (preg_* functions), or regular string inspection and manipulation. What have you tried so far? Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1386963 Share on other sites More sharing options...
silkfire Posted October 22, 2012 Author Share Posted October 22, 2012 If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate. If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value. Or just use string functions. UTF-8 is variable-width man, take a look here: http://en.wikipedia.org/wiki/UTF-8 But yeah, the basic question is, how do I find the position of the first multi-byte character? I've tried with some custom mb_explode function and then looped through the array but it's not so performance effective. Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1386967 Share on other sites More sharing options...
requinix Posted October 22, 2012 Share Posted October 22, 2012 Lemme say it a different way: function strposFirstMbChar($string) { $len = strlen($string); for ($i = 0; $i < $len; $i++) { if ($string[$i] >= "\xC0") { return $i; } } return false; } Quote Link to comment https://forums.phpfreaks.com/topic/269764-get-position-of-first-unicode-character/#findComment-1387033 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.