Jump to content

Archived

This topic is now archived and is closed to further replies.

yaba

PHP + preg + international chars problem

Recommended Posts

This one probably goes in the general php help, but it involves regex too, so here it is:

What I need is to replace some stuff in an international-character string, specifically a greek word.

While I can get everything to work fine on my PC, this is not the case with my host's server. Here's a small example:

[code=php:0]
echo mb_detect_encoding($word)." 1- ".$word."<br/>";
$word = preg_replace("/\W/", "", $word);
echo mb_detect_encoding($word)." 2- ".$word."<br/>";
[/code]

On my machine, this would print (as expected):

[code]
UTF-8 1- ααα.
UTF-8 2- ααα[/code]
where ααα is some greek-charactere word.

On my host's server, the same code prints:

[code]
UTF-8 1- ααα.
ASCII 2- [/code]

which of course is not what I need.

Moreover, I think ereg* functions work on my host too, but they' re not as handy, hence I need to use preg* functions.

Can please, PLEASE! someone help me here. I've been pulling my hair for days!

Thanks for even reading :)

Share this post


Link to post
Share on other sites
\W can change based on your locale settings. Do you know yours or your host's? My guess is that your machine is Unicode (UTF-8) aware, but your host is not. Your machine sees \W as [^a-zA-Z0-9_[i]and lots of other Unicode characters, including your Greek ones[/i]], which explains why only the period is removed. Your server sees \W as [^a-zA-Z0-9_], which explains why the whole string is emptied. Do you know if the server is Windows? It might be ISO-8559-1.

Share this post


Link to post
Share on other sites
The server is definately not windows. I'm on a VDS. Thei weird thing is that ereg*functions work with greek. Well, sort of... didn't investigate too much.

Even if what you are saying is happening, then wouldn't it be normal for the string to be returned intact, instead of empty?

And another thing, how bad must a host be not to support UTF8 in 2006, if what you are saying is true?!

Any suggestions on what I can do?

Thanks again....

Share this post


Link to post
Share on other sites
Can you expand on the ereg part? What do you mean by sort of? Do you have code that you've tried?

\W always means something. At the bare minimum it means [^a-zA-Z0-9_], which would still match in your string.

The first approach I would take would be to figure out the locale (encoding) that your computer is using, and the one that your host is using. You may also want to provide them with the code you've shown here.

Share this post


Link to post
Share on other sites
This:

[code=php:0]
setlocale(LC_ALL, "en_US");
echo setlocale(LC_ALL,"")."<br/>";
[/code]

produces this on my machine:
[code]
Greek_Greece.1253
[/code]

and this on the host:
[code]
C
[/code]

Dunno if this helps or means anything, specially since this:

[code=php:0]
//$_GET['term'] is the first letter of the $word.
echo $word."<br/>";
$word = eregi_replace("(.*)(".($_GET['term']).")(.*)", "=>\\2", ($word));
echo $word."<br/>";
[/code]

produces this, on BOTH machines:

[code]
αλάργο
=>α
[/code]

which is exactly what 's expected... So ereg works, preg doesn't... too bad since ereg don;t have things like \b and \W... :/

Share this post


Link to post
Share on other sites
I wonder if your host does not support Greek? Here is what I was testing with--I could not get a Greek locale to work:

[code]
<meta charset="utf-8"/>
<pre>
<?php
### SET YOUR LOCALE HERE.

### Create the "GREEK SMALL LETTER ALPHA" character.
$funny_a = pack("c*", 0xCE, 0xB1);
### Create a string of 3 characters + a period.
$string = $funny_a . $funny_a . $funny_a . '.';
### Show.
echo "string before >>>$string<<<";
echo '<br/>';
### Run replace.
$string = preg_replace('/\W/', '', $string);
### Show.
echo "string after >>>$string<<<";

?>
</pre>
[/code]

See if you can set the locale to Greek based on setlocale's documentation. I think it's "ell" or "ell_ell".

The ereg works because you're not doing anything special, like \W. PREG will work with the same pattern if you add delimiters.

If you are unable to set the locale, you may be able to create your own version of \W based on the Unicode character charts.

Share this post


Link to post
Share on other sites
First of all, thanks for keep trying to help. I appreciate it :)

You are right.... This works fine on the server too:

[code=php:0]$word = preg_replace("/(.*)(".($_GET['term']).")(.*)/i", "=>$2", ($word));[/code]

No matter what I use to setlocale ("el", "el_GR", "UTF-8" etc), it always returns "C" on the server. In any case, I would think UTF-8 should work for greek characters. So does that mean even UTF-8 is not supported on my server?

I noticed that according phpinfo() the 'default_charset' is currently set to 'no value'. Could that be it? Should I change it to 'utf-8' or something similar?

Thanks again!

Share this post


Link to post
Share on other sites
I see "no value" as well for both Windows and Unix servers. I'm assuming this means use what the OS is using.

I know nothing about locales on Windows, and little about locales on Unix. According to the man pages for [b]locale[/b], [tt]locale -a [/tt]lists all of the available locales. When I do this, I only see a dozen, none of which look like Greek. From searching the web, I found that [url=http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02]the "C" locale you're seeing is also the "POSIX" locale[/url], and based off of other information on this page, it looks like this is basically ASCII.

Do you know what OS your host is using? Can you run any commands to see its available locales?

I'm still trying to see if this can be done without a locale, using a code point range for the Greek chart.

Share this post


Link to post
Share on other sites
the locale command doesn't exist on my host's shell. However, the common dir for locales contains the following:

[code]
ls /usr/share/locale/
C  POSIX  en_US  en_US.utf8
[/code]

Some of those are empty dirs, but the en_US.utf8 which is probably the best shot here, is not. However, when I use setlocale(LC_ALL, "en_US.utf8"); I still get "C"....

I can't believe my host only got ASCII!!!

thanks once again :D

Share this post


Link to post
Share on other sites
Although UTF-8 is appearing, I think the key is the [b]en_[/b] prefix; I found something that says el_GR.UTF-8 is the locale for Greek. However, I don't understand how these are "installed"--I don't see this available on my box. [url=http://lists.freebsd.org/pipermail/freebsd-i18n/2004-June/000086.html]This seems related.[/url]

Share this post


Link to post
Share on other sites
Yes well, any UTF8 would probably be a good start, even the "installed" en_US.utf8. The problem is no matter what and how I call setlocale, when I print the locale set it always returns "C" (the posix thing)....

Share this post


Link to post
Share on other sites
This should print something if it works:

[code]
<?php
echo setlocale(LC_ALL, 'en_US.UTF-8');
?>
[/code]

If not, I'm assuming the locales you saw are not (properly) installed. Have you tried contacting your host?

Share this post


Link to post
Share on other sites
Yes, I did contact them and they installed a bunch of others too. Now, after the preg* is executed, the encoding doesn't change, and the string is there intact. But the preg* itself doesn't work, since it returns the whole string intact... :/

Share this post


Link to post
Share on other sites
Does[tt] setlocale [/tt]work now that new installs have been made?

Share this post


Link to post
Share on other sites
If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C". Don't know if that has anything to do anymore. I also discovered the iconv* functions, played with them, tried every possible combo, still nothing :/

Here's what I get atm:

[code=php:0]
$word="ααα ββ";
echo mb_detect_encoding($word)." 1) $word<br/>";
$word = preg_replace("/(.*)\b(.)(.*)/u", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word<br/>";

$word="aaa bb";
echo mb_detect_encoding($word)." 1) $word<br/>";
$word = preg_replace("/(.*)\b(.)(.*)/", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word<br/>";
[/code]

(The wierd chars above are supposed to read $word = "ααα ββ"; )

[code=output]
UTF-8 1) ααα ββ
UTF-8 2) ααα ββ
ASCII 1) aaa bb
ASCII 2) aaa __b__b
[/code]

Don't know what to think anymore. A few days more and a mental institution will be the only way for me :/

Share this post


Link to post
Share on other sites
[quote author=yaba link=topic=106444.msg427500#msg427500 date=1157339640]
If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C".
[/quote]

You are using a comma after[tt] LC_ALL [/tt]right? I take it the en_US.UTF-8 did not work? My ideas are exhausted until you can get[tt] setlocale [/tt] to work. I think you need to go back and forth with your host, showing them your code.

The only other idea I have is to use Perl, because it supports[tt] \u [/tt]which allows you to specify Unicode code points, thus establishing a range for the Greek block. I haven't had experience with this myself.

Share this post


Link to post
Share on other sites
Yup, I normally put the "," in, just forgot to add it this time...
UTF-8 didn't work :/

My host won't support basic stuff, showing them my code is extreme!

preg also supports the /u option, but so far it didn't do anything. I didn't do anything with greek blocks though. How do I do that? What exactly do you mean?

Thanks ;)

Share this post


Link to post
Share on other sites
Not[tt] /u [/tt] the modifier, nor[tt] \u [/tt] the uppercase escape, but[tt] \u [/tt] the Unicode escape. I found this in Mastering Regular Expressions, but I cannot find it in the Perl docs; I'm a little puzzled. Basically, I was trying to suggest that you match a code point range if your locale will not work. If you look at the [url=http://www.unicode.org/charts/PDF/U0370.pdf]Greek and Coptic chart[/url], it specifies a code point range of 0370-03FF. It would be nice if you could incorporate this into a regular expression with a simple[tt] [^\u0370-03FF][/tt]. Without being able to use code points or locales, your only other option (from my beginner's understanding of Unicode) would be to create an array of the UTF-8 for all of these characters to use in regular expression. This is still a nasty solution. I'm going to continue fooling with this as I have time... You may want to consider looking for a host that is knowledgeable and supportive in this area.

Share this post


Link to post
Share on other sites
[b]Eureka![/b] The[tt] /u [/tt]that you mentioned allows the[tt] \x{...}[/tt] to work. I also found [url=http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php#58409]this[/url] helpful. Run some more tests against this to see that it does what you want:

[code]
<meta charset="utf-8"/>
<pre>
<?php

### To make sure we're not cached while testing.
echo rand(), '<br/><br/>';
### Create the "GREEK SMALL LETTER ALPHA" character.
$alpha = pack("c*", 0xCE, 0xB1);
### Create a string of 3 characters + a period.
$string = $alpha . $alpha . $alpha . '.';
### Show before.
echo "string before >>>$string<<<";
echo '<br/><br/>';
### Run replace, showing what was found along the way.
$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
### Show.
echo "string after >>>$string<<<";

function test ($matches) {
array_shift($matches);
print_r($matches);
echo '<br/>';
return '';
}
?>
</pre>
[/code]

I get the following output:
[tt]
1557066832

string before >>>ααα.<<<

Array
(
    [0] => .
)

string after >>>ααα<<<
[/tt]

Share this post


Link to post
Share on other sites
Here's a test using the Greek and Armenian blocks. The first loop iteration shows the entire Greek set being ran through the regex, where no characters are found/replaced. The second loop iteration shows a combination of the Greek and Armenain blocks being ran through the regex, and you'll see that the entire Armenian block is removed.

[code]
<meta charset="utf-8"/>
<pre>
<?php

$greek_block = array();
foreach (range(880, 1023) as $code_point) {
$greek_block[] = code2utf($code_point);
}

$armenian_block = array();
foreach (range(1328, 1423) as $code_point) {
$armenian_block[] = code2utf($code_point);
}

$greek_string = join('', $greek_block);
$greek_armenian_string = join('', $greek_block) . join('', $armenian_block);

foreach (array($greek_string, $greek_armenian_string) as $string) {
### Show before.
echo "string before >>>$string<<<";
echo '<br/><br/>';
### Run replace, showing what was found along the way.
$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
### Show.
echo "string after >>>$string<<<";
echo '<hr/>';

}

function test ($matches) {
array_shift($matches);
print_r($matches);
echo '<br/>';
return '';
}

### Borrowed from http://us3.php.net/manual/en/function.utf8-encode.php#58461
function code2utf($num) {
  if($num<128)return chr($num);
  if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
  return '';
}
?>
</pre>
[/code]

Share this post


Link to post
Share on other sites
This certainly looks promising! I'll have a good look and report back...

Thanks for all your effor effigy :)

Share this post


Link to post
Share on other sites
Hello again!

How would I translate something like this:

[code=php:0]preg_replace("/(.*)\b(".($a).")(.*)/ui", "$1___$2____$3", ($b));[/code]

to use \x etc? I mean, when the pattern contains some chars that are obtained dynamically, what do I do?

To make it even more difficutl (!), what do I need to put to match both greek and latin (english) characters? So it would match for $a and $b being either both greek, or both english?

:-\

Share this post


Link to post
Share on other sites
[quote]How would I translate something like this to use \x etc?[/quote]
Depends on what you're trying to do... can you expand? Also, without the proper locale set, I wouldn't use[tt] \b[/tt].

[quote]what do I need to put to match both greek and latin (english) characters?[/quote]
Simply add another code point range. The [url=http://www.unicode.org/charts/PDF/U0000.pdf]Latin chart[/url] goes from 0000 to 007F; therefore, to match Greek and Latin, use [tt]/([\x{0370}-\x{03FF}\x{0000}-\x{007F}])/u[/tt].

Share this post


Link to post
Share on other sites
OK here's what I need to do:

given $_GET['searchLetter'], I perform a FT search in my DB for all words/phrases that contain at least a word that starts with that letter. For example if $_GET['searchLetter'] = 'a', then search would return 'A dog', 'some phrase with ALetter', and so on...

I then want to apply some css to highlight that letter (well, actually it can be a word or part of a word). With normal preg, I'd do it this way:

[code=php:0]$word = preg_replace("/(.*)\b(".($_GET['searchLetter']).")(.*)/i", "$1<span class=\"highlightmatch\">$2</span>$3", $word);[/code]

which works fine for english chars (and greek chars, on my local PC).

Thanks once again for all the help :)

Share this post


Link to post
Share on other sites
Try this:

[code]
<meta charset="utf-8"/>
<pre>
<?php

### Create the "GREEK SMALL LETTER ALPHA" character.
$alpha = pack("c*", 0xCE, 0xB1);
### Create a string using the alpha.
echo $string = "${alpha} string with ${alpha}n ${alpha}lph${alpha} ch${alpha}r${alpha}cter: ${alpha}bc, ${alpha}${alpha}${alpha}.";
### Show before.
echo '<br/><br/>';
### Run replace and highlight.
echo $string = preg_replace('/(?<=\p{Z})(' . $alpha . ')(?=.)/u', '<b><u>\1</u></b>', $string);
?>
</pre>
[/code]

It's my understanding that the[tt] /u [/tt]modifies the[tt] . [/tt]as well. You'll want to run this through more tests.

Share this post


Link to post
Share on other sites

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.