Jump to content


Photo

PHP + preg + international chars problem


  • Please log in to reply
29 replies to this topic

#1 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 04:56 AM

This one probably goes in the general php help, but it involves regex too, so here it is:

What I need is to replace some stuff in an international-character string, specifically a greek word.

While I can get everything to work fine on my PC, this is not the case with my host's server. Here's a small example:

echo mb_detect_encoding($word)." 1- ".$word."<br/>";
$word = preg_replace("/\W/", "", $word);
echo mb_detect_encoding($word)." 2- ".$word."<br/>";

On my machine, this would print (as expected):

UTF-8 1- ααα.
UTF-8 2- ααα
where ααα is some greek-charactere word.

On my host's server, the same code prints:

UTF-8 1- ααα.
ASCII 2-

which of course is not what I need.

Moreover, I think ereg* functions work on my host too, but they' re not as handy, hence I need to use preg* functions.

Can please, PLEASE! someone help me here. I've been pulling my hair for days!

Thanks for even reading :)

#2 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 01 September 2006 - 02:02 PM

\W can change based on your locale settings. Do you know yours or your host's? My guess is that your machine is Unicode (UTF-8) aware, but your host is not. Your machine sees \W as [^a-zA-Z0-9_and lots of other Unicode characters, including your Greek ones], which explains why only the period is removed. Your server sees \W as [^a-zA-Z0-9_], which explains why the whole string is emptied. Do you know if the server is Windows? It might be ISO-8559-1.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#3 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 02:53 PM

The server is definately not windows. I'm on a VDS. Thei weird thing is that ereg*functions work with greek. Well, sort of... didn't investigate too much.

Even if what you are saying is happening, then wouldn't it be normal for the string to be returned intact, instead of empty?

And another thing, how bad must a host be not to support UTF8 in 2006, if what you are saying is true?!

Any suggestions on what I can do?

Thanks again....

#4 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 01 September 2006 - 03:35 PM

Can you expand on the ereg part? What do you mean by sort of? Do you have code that you've tried?

\W always means something. At the bare minimum it means [^a-zA-Z0-9_], which would still match in your string.

The first approach I would take would be to figure out the locale (encoding) that your computer is using, and the one that your host is using. You may also want to provide them with the code you've shown here.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#5 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 04:05 PM

This:

setlocale(LC_ALL, "en_US");
echo setlocale(LC_ALL,"")."<br/>";

produces this on my machine:
Greek_Greece.1253

and this on the host:
C

Dunno if this helps or means anything, specially since this:

//$_GET['term'] is the first letter of the $word.
echo $word."<br/>";
$word = eregi_replace("(.*)(".($_GET['term']).")(.*)", "=>\\2", ($word));
echo $word."<br/>";

produces this, on BOTH machines:

αλάργο
=>α

which is exactly what 's expected... So ereg works, preg doesn't... too bad since ereg don;t have things like \b and \W... :/

#6 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 01 September 2006 - 05:06 PM

I wonder if your host does not support Greek? Here is what I was testing with--I could not get a Greek locale to work:

<meta charset="utf-8"/>
<pre>
<?php
	### SET YOUR LOCALE HERE.
	
	### Create the "GREEK SMALL LETTER ALPHA" character.
	$funny_a = pack("c*", 0xCE, 0xB1);
	### Create a string of 3 characters + a period.
	$string = $funny_a . $funny_a . $funny_a . '.';
	### Show.
	echo "string before >>>$string<<<";
	echo '<br/>';
	### Run replace.
	$string = preg_replace('/\W/', '', $string);
	### Show.
	echo "string after >>>$string<<<";

?>
</pre>

See if you can set the locale to Greek based on setlocale's documentation. I think it's "ell" or "ell_ell".

The ereg works because you're not doing anything special, like \W. PREG will work with the same pattern if you add delimiters.

If you are unable to set the locale, you may be able to create your own version of \W based on the Unicode character charts.

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#7 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 06:08 PM

First of all, thanks for keep trying to help. I appreciate it :)

You are right.... This works fine on the server too:

$word = preg_replace("/(.*)(".($_GET['term']).")(.*)/i", "=>$2", ($word));

No matter what I use to setlocale ("el", "el_GR", "UTF-8" etc), it always returns "C" on the server. In any case, I would think UTF-8 should work for greek characters. So does that mean even UTF-8 is not supported on my server?

I noticed that according phpinfo() the 'default_charset' is currently set to 'no value'. Could that be it? Should I change it to 'utf-8' or something similar?

Thanks again!

#8 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 01 September 2006 - 07:21 PM

I see "no value" as well for both Windows and Unix servers. I'm assuming this means use what the OS is using.

I know nothing about locales on Windows, and little about locales on Unix. According to the man pages for locale, locale -a lists all of the available locales. When I do this, I only see a dozen, none of which look like Greek. From searching the web, I found that the "C" locale you're seeing is also the "POSIX" locale, and based off of other information on this page, it looks like this is basically ASCII.

Do you know what OS your host is using? Can you run any commands to see its available locales?

I'm still trying to see if this can be done without a locale, using a code point range for the Greek chart.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#9 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 08:13 PM

the locale command doesn't exist on my host's shell. However, the common dir for locales contains the following:

ls /usr/share/locale/
C  POSIX  en_US  en_US.utf8

Some of those are empty dirs, but the en_US.utf8 which is probably the best shot here, is not. However, when I use setlocale(LC_ALL, "en_US.utf8"); I still get "C"....

I can't believe my host only got ASCII!!!

thanks once again :D

#10 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 01 September 2006 - 08:30 PM

Although UTF-8 is appearing, I think the key is the en_ prefix; I found something that says el_GR.UTF-8 is the locale for Greek. However, I don't understand how these are "installed"--I don't see this available on my box. This seems related.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#11 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 01 September 2006 - 10:04 PM

Yes well, any UTF8 would probably be a good start, even the "installed" en_US.utf8. The problem is no matter what and how I call setlocale, when I print the locale set it always returns "C" (the posix thing)....

#12 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 03 September 2006 - 06:05 AM

This should print something if it works:

<?php
	echo setlocale(LC_ALL, 'en_US.UTF-8');
?>

If not, I'm assuming the locales you saw are not (properly) installed. Have you tried contacting your host?
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#13 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 03 September 2006 - 07:29 PM

Yes, I did contact them and they installed a bunch of others too. Now, after the preg* is executed, the encoding doesn't change, and the string is there intact. But the preg* itself doesn't work, since it returns the whole string intact... :/

#14 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 03 September 2006 - 08:02 PM

Does setlocale work now that new installs have been made?
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#15 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 04 September 2006 - 03:14 AM

If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C". Don't know if that has anything to do anymore. I also discovered the iconv* functions, played with them, tried every possible combo, still nothing :/

Here's what I get atm:

$word="ααα ββ";
echo mb_detect_encoding($word)." 1) $word<br/>";
$word = preg_replace("/(.*)\b(.)(.*)/u", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word<br/>";

$word="aaa bb";
echo mb_detect_encoding($word)." 1) $word<br/>";
$word = preg_replace("/(.*)\b(.)(.*)/", "$1__$2__$3", $word);
echo mb_detect_encoding($word)." 2) $word<br/>";

(The wierd chars above are supposed to read $word = "ααα ββ"; )

UTF-8 1) ααα ββ
UTF-8 2) ααα ββ
ASCII 1) aaa bb
ASCII 2) aaa __b__b

Don't know what to think anymore. A few days more and a mental institution will be the only way for me :/

#16 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 04 September 2006 - 04:59 PM

If I just setlocale(LC_ALL "something") and then echo with NULL parameter, I still get "C".


You are using a comma after LC_ALL right? I take it the en_US.UTF-8 did not work? My ideas are exhausted until you can get setlocale to work. I think you need to go back and forth with your host, showing them your code.

The only other idea I have is to use Perl, because it supports \u which allows you to specify Unicode code points, thus establishing a range for the Greek block. I haven't had experience with this myself.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#17 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 07 September 2006 - 10:14 AM

Yup, I normally put the "," in, just forgot to add it this time...
UTF-8 didn't work :/

My host won't support basic stuff, showing them my code is extreme!

preg also supports the /u option, but so far it didn't do anything. I didn't do anything with greek blocks though. How do I do that? What exactly do you mean?

Thanks ;)

#18 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 07 September 2006 - 03:44 PM

Not /u the modifier, nor \u the uppercase escape, but \u the Unicode escape. I found this in Mastering Regular Expressions, but I cannot find it in the Perl docs; I'm a little puzzled. Basically, I was trying to suggest that you match a code point range if your locale will not work. If you look at the Greek and Coptic chart, it specifies a code point range of 0370-03FF. It would be nice if you could incorporate this into a regular expression with a simple [^\u0370-03FF]. Without being able to use code points or locales, your only other option (from my beginner's understanding of Unicode) would be to create an array of the UTF-8 for all of these characters to use in regular expression. This is still a nasty solution. I'm going to continue fooling with this as I have time... You may want to consider looking for a host that is knowledgeable and supportive in this area.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#19 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 07 September 2006 - 04:13 PM

Eureka! The /u that you mentioned allows the \x{...} to work. I also found this helpful. Run some more tests against this to see that it does what you want:

<meta charset="utf-8"/>
<pre>
<?php
	
	### To make sure we're not cached while testing.
	echo rand(), '<br/><br/>';
	### Create the "GREEK SMALL LETTER ALPHA" character.
	$alpha = pack("c*", 0xCE, 0xB1);
	### Create a string of 3 characters + a period.
	$string = $alpha . $alpha . $alpha . '.';
	### Show before.
	echo "string before >>>$string<<<";
	echo '<br/><br/>';
	### Run replace, showing what was found along the way.
	$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
	### Show.
	echo "string after >>>$string<<<";
	
	function test ($matches) {
		array_shift($matches);
		print_r($matches);
		echo '<br/>';
		return '';
	}
?>
</pre>

I get the following output:

1557066832

string before >>>ααα.<<<

Array
(
    [0] => .
)

string after >>>ααα<<<

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#20 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 07 September 2006 - 07:24 PM

Here's a test using the Greek and Armenian blocks. The first loop iteration shows the entire Greek set being ran through the regex, where no characters are found/replaced. The second loop iteration shows a combination of the Greek and Armenain blocks being ran through the regex, and you'll see that the entire Armenian block is removed.

<meta charset="utf-8"/>
<pre>
<?php

	$greek_block = array();
	foreach (range(880, 1023) as $code_point) {
		$greek_block[] = code2utf($code_point);
	}
	
	$armenian_block = array();
	foreach (range(1328, 1423) as $code_point) {
		$armenian_block[] = code2utf($code_point);
	}	
	
	$greek_string = join('', $greek_block);
	$greek_armenian_string = join('', $greek_block) . join('', $armenian_block);
	
	foreach (array($greek_string, $greek_armenian_string) as $string) {
		### Show before.
		echo "string before >>>$string<<<";
		echo '<br/><br/>';
		### Run replace, showing what was found along the way.
		$string = preg_replace_callback('/([^\x{0370}-\x{03FF}])/u', 'test', $string);
		### Show.
		echo "string after >>>$string<<<";
		echo '<hr/>';
	
	}
	
	function test ($matches) {
		array_shift($matches);
		print_r($matches);
		echo '<br/>';
		return '';
	}
	
	### Borrowed from http://us3.php.net/manual/en/function.utf8-encode.php#58461
	function code2utf($num) {
	   if($num<128)return chr($num);
	   if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
	   if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
	   if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
	   return '';
	}
?>
</pre>

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users