Jump to content

Get Position Of First Unicode Character


silkfire

Recommended Posts

I need a way to determine the position of the first unicode character in a string.

 

For example, if we have '89423io3önska032j' (encoded in UTF-8), I need to return number 8.

 

None of the mb_ functions I found were suitable for this purpose. Hope someone can help me on this one. I also need the function to be performance effective, so no advanced solutions as this will be used in a loop.

Link to comment
Share on other sites

If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate.

 

If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value.

 

Or just use string functions.

Link to comment
Share on other sites

I need a way to determine the position of the first unicode character in a string.

 

Basic Latin characters (for example, A-Z, 0-9) don't count? Your example hints that they don't, but they're unicode characters too.

 

For example, if we have '89423io3önska032j' (encoded in UTF-8), I need to return number 8.

 

Ahh, so it looks like you want the offset of the first character which isn't in the "C0 Controls and Basic Latin" table or perhaps some smaller set: just A-Z, a-z, 0-9?

 

None of the mb_ functions I found were suitable for this purpose.

 

Which ones did you "find"? I am certain this task could be readily resolved with at least one of the available mb_* functions. Or with PCRE regex (preg_* functions), or regular string inspection and manipulation. What have you tried so far?

Link to comment
Share on other sites

If the string is UTF-8, the string is UTF-8. All characters in the string are UTF-8 encoded. Strings aren't like MP3s, they're not variable bitrate.

 

If, like requinix said, you want to find the first character in the string which has a multi-byte encoding, then that's a whole other question and you could conceivably do it in regex with a unicode range. Look up how to do unicode characters in regex (something like \u123), then put a range [a-z] in your code where A is the first character you want to detect and Z is the maximum possible value.

 

Or just use string functions.

UTF-8 is variable-width man, take a look here: http://en.wikipedia.org/wiki/UTF-8

 

But yeah, the basic question is, how do I find the position of the first multi-byte character?

 

I've tried with some custom mb_explode function and then looped through the array but it's not so performance effective.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.