PHP + preg + international chars problem

yaba · September 1, 2006

This one probably goes in the general php help, but it involves regex too, so here it is:

What I need is to replace some stuff in an international-character string, specifically a greek word.

While I can get everything to work fine on my PC, this is not the case with my host's server. Here's a small example:

[code=php:0]
echo mb_detect_encoding($word)." 1- ".$word." ";
$word = preg_replace("/\W/", "", $word);
echo mb_detect_encoding($word)." 2- ".$word." ";
[/code]

On my machine, this would print (as expected):

[code]
UTF-8 1- ααα.
UTF-8 2- ααα[/code]
where ααα is some greek-charactere word.

On my host's server, the same code prints:

[code]
UTF-8 1- ααα.
ASCII 2- [/code]

which of course is not what I need.

Moreover, I think ereg* functions work on my host too, but they' re not as handy, hence I need to use preg* functions.

Can please, PLEASE! someone help me here. I've been pulling my hair for days!

Thanks for even reading :)

effigy · September 1, 2006

\W can change based on your locale settings. Do you know yours or your host's? My guess is that your machine is Unicode (UTF-8) aware, but your host is not. Your machine sees \W as [^a-zA-Z0-9_[i]and lots of other Unicode characters, including your Greek ones[/i]], which explains why only the period is removed. Your server sees \W as [^a-zA-Z0-9_], which explains why the whole string is emptied. Do you know if the server is Windows? It might be ISO-8559-1.

yaba · September 1, 2006

The server is definately not windows. I'm on a VDS. Thei weird thing is that ereg*functions work with greek. Well, sort of... didn't investigate too much.

Even if what you are saying is happening, then wouldn't it be normal for the string to be returned intact, instead of empty?

And another thing, how bad must a host be not to support UTF8 in 2006, if what you are saying is true?!

Any suggestions on what I can do?

Thanks again....

effigy · September 1, 2006

Can you expand on the ereg part? What do you mean by sort of? Do you have code that you've tried?

\W always means something. At the bare minimum it means [^a-zA-Z0-9_], which would still match in your string.

The first approach I would take would be to figure out the locale (encoding) that your computer is using, and the one that your host is using. You may also want to provide them with the code you've shown here.

yaba · September 1, 2006

This:

[code=php:0]
setlocale(LC_ALL, "en_US");
echo setlocale(LC_ALL,"")." ";
[/code]

produces this on my machine:
[code]
Greek_Greece.1253
[/code]

and this on the host:
[code]
C
[/code]

Dunno if this helps or means anything, specially since this:

[code=php:0]
//$_GET['term'] is the first letter of the $word.
echo $word." ";
$word = eregi_replace("(.*)(".($_GET['term']).")(.*)", "=>\\2", ($word));
echo $word." ";
[/code]

produces this, on BOTH machines:

[code]
αλάργο
=>α
[/code]

which is exactly what 's expected... So ereg works, preg doesn't... too bad since ereg don;t have things like \b and \W... :/

effigy · September 1, 2006

I wonder if your host does not support Greek? Here is what I was testing with--I could not get a Greek locale to work:

[code]
<meta charset="utf-8"/>
<pre>
<?php
### SET YOUR LOCALE HERE.

### Create the "GREEK SMALL LETTER ALPHA" character.
$funny_a = pack("c*", 0xCE, 0xB1);
### Create a string of 3 characters + a period.
$string = $funny_a . $funny_a . $funny_a . '.';
### Show.
echo "string before >>>$string<<<";
echo ' ';
### Run replace.
$string = preg_replace('/\W/', '', $string);
### Show.
echo "string after >>>$string<<<";

?>
</pre>
[/code]

See if you can set the locale to Greek based on setlocale's documentation. I think it's "ell" or "ell_ell".

The ereg works because you're not doing anything special, like \W. PREG will work with the same pattern if you add delimiters.

If you are unable to set the locale, you may be able to create your own version of \W based on the Unicode character charts.

yaba · September 1, 2006

First of all, thanks for keep trying to help. I appreciate it :)

You are right.... This works fine on the server too:

[code=php:0]$word = preg_replace("/(.*)(".($_GET['term']).")(.*)/i", "=>$2", ($word));[/code]

No matter what I use to setlocale ("el", "el_GR", "UTF-8" etc), it always returns "C" on the server. In any case, I would think UTF-8 should work for greek characters. So does that mean even UTF-8 is not supported on my server?

I noticed that according phpinfo() the 'default_charset' is currently set to 'no value'. Could that be it? Should I change it to 'utf-8' or something similar?

Thanks again!

effigy · September 1, 2006

I see "no value" as well for both Windows and Unix servers. I'm assuming this means use what the OS is using.

I know nothing about locales on Windows, and little about locales on Unix. According to the man pages for [b]locale[/b], [tt]locale -a [/tt]lists all of the available locales. When I do this, I only see a dozen, none of which look like Greek. From searching the web, I found that [url=http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02]the "C" locale you're seeing is also the "POSIX" locale[/url], and based off of other information on this page, it looks like this is basically ASCII.

Do you know what OS your host is using? Can you run any commands to see its available locales?

I'm still trying to see if this can be done without a locale, using a code point range for the Greek chart.

yaba · September 1, 2006

the locale command doesn't exist on my host's shell. However, the common dir for locales contains the following:

[code]
ls /usr/share/locale/
C POSIX en_US en_US.utf8
[/code]

Some of those are empty dirs, but the en_US.utf8 which is probably the best shot here, is not. However, when I use setlocale(LC_ALL, "en_US.utf8"); I still get "C"....

I can't believe my host only got ASCII!!!

thanks once again :D

effigy · September 1, 2006

Although UTF-8 is appearing, I think the key is the [b]en_[/b] prefix; I found something that says el_GR.UTF-8 is the locale for Greek. However, I don't understand how these are "installed"--I don't see this available on my box. [url=http://lists.freebsd.org/pipermail/freebsd-i18n/2004-June/000086.html]This seems related.[/url]

yaba · September 1, 2006

Yes well, any UTF8 would probably be a good start, even the "installed" en_US.utf8. The problem is no matter what and how I call setlocale, when I print the locale set it always returns "C" (the posix thing)....

effigy · September 3, 2006

This should print something if it works:

[code]
<?php
echo setlocale(LC_ALL, 'en_US.UTF-8');
?>
[/code]

If not, I'm assuming the locales you saw are not (properly) installed. Have you tried contacting your host?

yaba · September 3, 2006

Yes, I did contact them and they installed a bunch of others too. Now, after the preg* is executed, the encoding doesn't change, and the string is there intact. But the preg* itself doesn't work, since it returns the whole string intact... :/

effigy · September 3, 2006

Does[tt] setlocale [/tt]work now that new installs have been made?

yaba · September 4, 2006

If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C". Don't know if that has anything to do anymore. I also discovered the iconv* functions, played with them, tried every possible combo, still nothing :/

Here's what I get atm:

[code=php:0]
$word="ααα ββ";
echo mb_detect_encoding($word)." 1) $word ";
$word = preg_replace("/(.*)\b(.)(.*)/u", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word ";

$word="aaa bb";
echo mb_detect_encoding($word)." 1) $word ";
$word = preg_replace("/(.*)\b(.)(.*)/", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word ";
[/code]

(The wierd chars above are supposed to read $word = "ααα ββ"; )

[code=output]
UTF-8 1) ααα ββ
UTF-8 2) ααα ββ
ASCII 1) aaa bb
ASCII 2) aaa __b__b
[/code]

Don't know what to think anymore. A few days more and a mental institution will be the only way for me :/

effigy · September 4, 2006

[quote author=yaba link=topic=106444.msg427500#msg427500 date=1157339640]
If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C".
[/quote]

You are using a comma after[tt] LC_ALL [/tt]right? I take it the en_US.UTF-8 did not work? My ideas are exhausted until you can get[tt] setlocale [/tt] to work. I think you need to go back and forth with your host, showing them your code.

The only other idea I have is to use Perl, because it supports[tt] \u [/tt]which allows you to specify Unicode code points, thus establishing a range for the Greek block. I haven't had experience with this myself.

yaba · September 7, 2006

Yup, I normally put the "," in, just forgot to add it this time...
UTF-8 didn't work :/

My host won't support basic stuff, showing them my code is extreme!

preg also supports the /u option, but so far it didn't do anything. I didn't do anything with greek blocks though. How do I do that? What exactly do you mean?

Thanks ;)

effigy · September 7, 2006

Not[tt] /u [/tt] the modifier, nor[tt] \u [/tt] the uppercase escape, but[tt] \u [/tt] the Unicode escape. I found this in Mastering Regular Expressions, but I cannot find it in the Perl docs; I'm a little puzzled. Basically, I was trying to suggest that you match a code point range if your locale will not work. If you look at the [url=http://www.unicode.org/charts/PDF/U0370.pdf]Greek and Coptic chart[/url], it specifies a code point range of 0370-03FF. It would be nice if you could incorporate this into a regular expression with a simple[tt] [^\u0370-03FF][/tt]. Without being able to use code points or locales, your only other option (from my beginner's understanding of Unicode) would be to create an array of the UTF-8 for all of these characters to use in regular expression. This is still a nasty solution. I'm going to continue fooling with this as I have time... You may want to consider looking for a host that is knowledgeable and supportive in this area.

effigy · September 7, 2006

[b]Eureka![/b] The[tt] /u [/tt]that you mentioned allows the[tt] \x{...}[/tt] to work. I also found [url=http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php#58409]this[/url] helpful. Run some more tests against this to see that it does what you want:

[code]
<meta charset="utf-8"/>
<pre>
<?php

### To make sure we're not cached while testing.
echo rand(), ' ';
### Create the "GREEK SMALL LETTER ALPHA" character.
$alpha = pack("c*", 0xCE, 0xB1);
### Create a string of 3 characters + a period.
$string = $alpha . $alpha . $alpha . '.';
### Show before.
echo "string before >>>$string<<<";
echo ' ';
### Run replace, showing what was found along the way.
$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
### Show.
echo "string after >>>$string<<<";

function test ($matches) {
array_shift($matches);
print_r($matches);
echo ' ';
return '';
}
?>
</pre>
[/code]

I get the following output:
[tt]
1557066832

string before >>>ααα.<<<

Array
(
[0] => .
)

string after >>>ααα<<<
[/tt]

effigy · September 7, 2006

Here's a test using the Greek and Armenian blocks. The first loop iteration shows the entire Greek set being ran through the regex, where no characters are found/replaced. The second loop iteration shows a combination of the Greek and Armenain blocks being ran through the regex, and you'll see that the entire Armenian block is removed.

[code]
<meta charset="utf-8"/>
<pre>
<?php

$greek_block = array();
foreach (range(880, 1023) as $code_point) {
$greek_block[] = code2utf($code_point);
}

$armenian_block = array();
foreach (range(1328, 1423) as $code_point) {
$armenian_block[] = code2utf($code_point);
}

$greek_string = join('', $greek_block);
$greek_armenian_string = join('', $greek_block) . join('', $armenian_block);

foreach (array($greek_string, $greek_armenian_string) as $string) {
### Show before.
echo "string before >>>$string<<<";
echo ' ';
### Run replace, showing what was found along the way.
$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
### Show.
echo "string after >>>$string<<<";
echo '<hr/>';

}

function test ($matches) {
array_shift($matches);
print_r($matches);
echo ' ';
return '';
}

### Borrowed from http://us3.php.net/manual/en/function.utf8-encode.php#58461
function code2utf($num) {
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
</pre>
[/code]

yaba · September 8, 2006

This certainly looks promising! I'll have a good look and report back...

Thanks for all your effor effigy :)

yaba · September 9, 2006

Hello again!

How would I translate something like this:

[code=php:0]preg_replace("/(.*)\b(".($a).")(.*)/ui", "$1___$2____$3", ($b));[/code]

to use \x etc? I mean, when the pattern contains some chars that are obtained dynamically, what do I do?

To make it even more difficutl (!), what do I need to put to match both greek and latin (english) characters? So it would match for $a and $b being either both greek, or both english?

:-\

effigy · September 10, 2006

[quote]How would I translate something like this to use \x etc?[/quote]
Depends on what you're trying to do... can you expand? Also, without the proper locale set, I wouldn't use[tt] \b[/tt].

[quote]what do I need to put to match both greek and latin (english) characters?[/quote]
Simply add another code point range. The [url=http://www.unicode.org/charts/PDF/U0000.pdf]Latin chart[/url] goes from 0000 to 007F; therefore, to match Greek and Latin, use [tt]/([\x{0370}-\x{03FF}\x{0000}-\x{007F}])/u[/tt].

yaba · September 10, 2006

OK here's what I need to do:

given $_GET['searchLetter'], I perform a FT search in my DB for all words/phrases that contain at least a word that starts with that letter. For example if $_GET['searchLetter'] = 'a', then search would return 'A dog', 'some phrase with ALetter', and so on...

I then want to apply some css to highlight that letter (well, actually it can be a word or part of a word). With normal preg, I'd do it this way:

[code=php:0]$word = preg_replace("/(.*)\b(".($_GET['searchLetter']).")(.*)/i", "$1$2$3", $word);[/code]

which works fine for english chars (and greek chars, on my local PC).

Thanks once again for all the help :)

effigy · September 11, 2006

Try this:

[code]
<meta charset="utf-8"/>
<pre>
<?php

### Create the "GREEK SMALL LETTER ALPHA" character.
$alpha = pack("c*", 0xCE, 0xB1);
### Create a string using the alpha.
echo $string = "${alpha} string with ${alpha}n ${alpha}lph${alpha} ch${alpha}r${alpha}cter: ${alpha}bc, ${alpha}${alpha}${alpha}.";
### Show before.
echo ' ';
### Run replace and highlight.
echo $string = preg_replace('/(?<=\p{Z})(' . $alpha . ')(?=.)/u', '\1', $string);
?>
</pre>
[/code]

It's my understanding that the[tt] /u [/tt]modifies the[tt] . [/tt]as well. You'll want to run this through more tests.

Sign In

PHP + preg + international chars problem

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information