Jump to content

preg_match and what's going on with encoding?


junk@alf2.com

Recommended Posts

This has been driving me nuts over the past few hours, so am hoping someone can help me understand what's going on here.

 

I have two servers, one online (linux) and my pc that I test on (windows), both running php 5.1.6.  The same script on the two servers produces different output but I can't figure out why.

 

SCRIPT:

 

    $input = "Bóthar greatest";
    printf("original: %s<br>\r\n", $input);
    printf("iconv: %s<br>\r\n", iconv('', 'UTF-8', $input));
    printf("utf8 decoded: %s<br>\r\n", utf8_decode($input));
    printf("iconv trans decoded: %s<br>\r\n", iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $input));
    preg_match('/([\w]{2,}|\d+)/i', utf8_decode($input), $match, PREG_OFFSET_CAPTURE);
    printf("<pre>%s</pre><br>\r\n", print_r($match, true));
    preg_match('/([\w]{2,}|\d+)/i', $input, $match, PREG_OFFSET_CAPTURE);
    printf("<pre>%s</pre><br>\r\n", print_r($match, true));

 

WINDOWS OUTPUT:

 

original: Bóthar greatest<br>
iconv: Bóthar greatest<br>
utf8 decoded: Bóthar greatest<br>
iconv trans decoded: Bóthar greatest<br>
<pre>Array
(
    [0] => Array
        (
            [0] => Bóthar
            [1] => 0
        )

    [1] => Array
        (
            [0] => Bóthar
            [1] => 0
        )

)
</pre><br>
<pre>Array
(
    [0] => Array
        (
            [0] => Bóthar
            [1] => 0
        )

    [1] => Array
        (
            [0] => Bóthar
            [1] => 0
        )

)
</pre><br>

 

LINUX OUTPUT:

 

original: Bóthar greatest<br>
iconv: B<br>
utf8 decoded: Bóthar greatest<br>
iconv trans decoded: Bóthar greatest<br>
<pre>Array
(
    [0] => Array
        (
            [0] => thar
            [1] => 2
        )

    [1] => Array
        (
            [0] => thar
            [1] => 2
        )

)
</pre><br>
<pre>Array
(
    [0] => Array
        (
            [0] => thar
            [1] => 3
        )

    [1] => Array
        (
            [0] => thar
            [1] => 3
        )

)
</pre><br>

 

My problem appears to be that the linux preg_match is not matching anything outside ascii, whereas the windows preg_match has no problem matching the "ó".  Also on Linux iconv converts "Bóthar" to "B" whereas windows converts to "Bóthar".  Why's that?

This has been driving me nuts over the past few hours, so am hoping someone can help me understand what's going on here.

 

In my php.ini file the iconv settings are the same for both platforms (i.e. everything is: ISO-8859-1), except that windows uses "libiconv" (1.9) and linux uses "glibc" (2.3.4).

 

Has anyone any pointers as to where I should be looking, so that I can get my two platforms aligned or why the preg_match function can "see" non ascii characters?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.