Jump to content

Recommended Posts

try this *untested*

 

    /**
     * Count the number of bytes of a given string.
     * Input string is expected to be ASCII or UTF-8 encoded.
     * Warning: the function doesn't return the number of chars
     * in the string, but the number of bytes.
     *
     * @param string $str The string to compute number of bytes
     *
     * @return The length in bytes of the given string.
     */
    function strBytes($str)
    {
      // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT
     
      // Number of characters in string
      $strlen_var = strlen($str);

      // string bytes counter
      $d = 0;
     
     /*
      * Iterate over every character in the string,
      * escaping with a slash or encoding to UTF-8 where necessary
      */
      for ($c = 0; $c < $strlen_var; ++$c) {
         
          $ord_var_c = ord($str{$d});
         
          switch (true) {
              case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)):
                  // characters U-00000000 - U-0000007F (same as ASCII)
                  $d++;
                  break;
             
              case (($ord_var_c & 0xE0) == 0xC0):
                  // characters U-00000080 - U-000007FF, mask 110XXXXX
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
                  $d+=2;
                  break;

              case (($ord_var_c & 0xF0) == 0xE0):
                  // characters U-00000800 - U-0000FFFF, mask 1110XXXX
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
                  $d+=3;
                  break;

              case (($ord_var_c & 0xF8) == 0xF0):
                  // characters U-00010000 - U-001FFFFF, mask 11110XXX
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
                  $d+=4;
                  break;

              case (($ord_var_c & 0xFC) == 0xF8):
                  // characters U-00200000 - U-03FFFFFF, mask 111110XX
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
                  $d+=5;
                  break;

              case (($ord_var_c & 0xFE) == 0xFC):
                  // characters U-04000000 - U-7FFFFFFF, mask 1111110X
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
                  $d+=6;
                  break;
              default:
                $d++;   
          }
      }
     
      return $d;
    }

Woah!  Thanks MadTechie, didn't know that it would be so much work, I was hoping there would be a built-in function, otherwise I would've done more research myself instead of you typing all that out.  I'll give that look, though, thanks.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.