UTF-8 preg_replace returning null

Paperstyle · April 1, 2008

I'm working on a function to Anglicise a string. Here's part of it in a testing assembly:

(apologies for it not being in code tags, but it was converting the characters and making it less readable)

<html>

<head>

<title> splurd </title>

</head>

<body>

<?php

function oooo($string) {

$string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here

}

$ooooo = "$_POST[input]";

$ooooo = oooo($ooooo);

echo "<h1>$ooooo</h1>"; // nothing inside the header tags

echo ord($ooooo); // 0

var_dump($ooooo); // NULL

echo str_replace("Ố", "O", "ỐỐ"); // prints "OO"

$text="עברית מבולגנת";

function hebrewNotWordEndSwitch ($from, $to, $text) {

$text=

preg_replace('/'.$from.'([א-ת])/u','$2'.$to.'$1',$text);

return $text;

}

do {

$text_before=$text;

$text=hebrewNotWordEndSwitch("ך","כ",$text);

$text=hebrewNotWordEndSwitch("ם","מ",$text);

$text=hebrewNotWordEndSwitch("ן","נ",$text);

$text=hebrewNotWordEndSwitch("ף","פ",$text);

$text=hebrewNotWordEndSwitch("ץ","צ",$text);

} while ( $text_before!=$text );

print $text; // עברית מסודרת!

?>

</body>

</html>

I also got a function from http://au2.php.net/manual/en/function.preg-replace.php to test if it was indeed working (the Hebrew one), and it works.

So I send the following string to it: ÒÓÔÕÖ . It returns null. I know that if the pattern is malformed utf-8 the function returns null, but I can't see what's wrong with it.

Any help would be very much appreciated.

Thanks.

EDIT: I'm on FreeBSD 6.1 using PHP 5.2.5 (updated yesterday).

dsaba · April 1, 2008

I think you forgot to escape some parenthesis here at (O)

I don't think you can put a substring or subgroup inside of a character class.

$string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here

and you also put a caret sign ^ in the middle of a character class which will match ^ literally, like a string literal. If you want it to negate all chars in the char class put it in the begg. of the character class. ie:

[^negatethesechars]

The problem could also be from your form/script that is running the code. I noticed you are testing with a string being posted from a form. General rule of thumb is keep all data in uniform encoding to ensure that it stays in that encoding. That means having your browser encode in utf-8, forcing your browser to do this with html encoding headers, encoding the actual html file/php script in utf-8 encoding..etc.. A discrepancy in one of these could be changing the uniformity of the utf-8 data being passed around the script.

Paperstyle · April 2, 2008

Unfortunately I set the encodings for both pages to be the same: utf-8.

On my side there's no caret in the middle of the character class. Are all the characters coming through properly (they're variations on 'O' with different linguistic add-ons)? Maybe they're encoded wrong and that's the problem, though they should be fine because I got them from the character map.

I tried putting parentheses in but it didn't work.

Thanks for your help, but are there any other ideas?

effigy · April 2, 2008

How about this?

Paperstyle · April 3, 2008

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

Oh, and I found on this page here http://www.unicode.org/unicode/reports/tr15/ that the NFKD and NFKC will decompose characters like the Æ into A and E (the example is in the introduction and uses the ﬁ ligature character).

Thanks a lot for your help.

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

Thanks.

effigy · April 3, 2008

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

Correct.

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

Are you sure $string is in UTF-8? I'm not sure how PHP "understands" those characters that you've placed directly in the code. What if you convert that to a preg_match--does it detect the characters?

Paperstyle · April 4, 2008

echo "<br/>The encoding of the posted string is " . mb_detect_encoding($_POST['input'], "auto");  // UTF-8

function matchbox($string) {
$string = preg_match('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', $string) or print "spungo";
}
...
echo "<br/>Testing preg_match: " . matchbox($_POST['input']) . "<br/>";  // nothing printed from the function

I just thought then that it could be something about those particular characters, so I tested with some others:

echo "<br/>Testing other characters: " . preg_replace('/[ƁƂʙᴃᴮᴯḂḄḆ]/u', 'B', $_POST['input']) . "<br/>";

with the input string being ᴃᴮᴯḂ, and it converted all of them to 'B'.

I'll try replacing the original 'O' RegExp with hex-encoded values for the characters instead and post the results of that.

Paperstyle · April 4, 2008

So now I've put the string of 'O's as octal-escaped, like so:

function oCodes($string) {
// In the above order
// \307\252 goes in between \307\221 and \307\254
$codes = "\303\222\303\223\303\224\303\225\303\226\303\227\303\228\303\229\303\230\305\214\305\216\305\220\306\237\306\240\307\221\307\254\307\276\310\214\310\216\310\252\310\254\310\256\310\260\312\230" . /* now begin the 3-octals, with ᴏ */ "\341\264\217\341\264\274\341\271\214\341\271\216\341\271\220\341\271\222\341\273\214\341\273\216\341\273\220\341\273\222\341\273\224\341\273\226\341\273\230\341\273\232\341\273\234\341\273\236\341\273\240\341\273\242";
return preg_replace("/[$codes]/u", 'O', $string);
}

This returned the following warning: Warning: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 14 in /usr/local/www/apache22/data/testformproc.php on line 18

effigy · April 4, 2008

What editor are you using and what character set/encoding are you saving the file in? The example below works for me by matching all of the characters; however, some of them would not work when copied--these would need to be packed. Why not use the other method?

<pre>

<?php

$chars = utf8_encode('ÒÓÔÕÖØO');

function matchbox($string) {

global $chars;

$string = preg_match_all('/([' . preg_quote($chars) . '])/u', $string, $matches);

print_r($matches);

}

if ($_POST) {

print_r($_POST);

matchbox($_POST['chars']);

}

else {

?>

</form>

<?php

}

?>

</pre>

Paperstyle · April 5, 2008

I will use the other method. I just thought there might be a simple explanation of why it wasn't working.

Editor: gedit

Encoding: UTF-8

Thanks plenty for all your help.

Sign In

UTF-8 preg_replace returning null

Recommended Posts

Paperstyle

Link to comment

Share on other sites

dsaba

Link to comment

Share on other sites

Paperstyle

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Paperstyle

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Paperstyle

Link to comment

Share on other sites

Paperstyle

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Paperstyle

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information