Jump to content

UTF-8 preg_replace returning null


Paperstyle

Recommended Posts

I'm working on a function to Anglicise a string. Here's part of it in a testing assembly:

(apologies for it not being in code tags, but it was converting the characters and making it less readable)

 

 

 

<html>

<head>

<title> splurd </title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

</head>

<body>

<?php

 

function oooo($string) {

$string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo";  // nothing prints here

}

 

$ooooo = "$_POST[input]";

$ooooo = oooo($ooooo);

 

echo "<h1>$ooooo</h1>";  // nothing inside the header tags

echo ord($ooooo);  // 0

var_dump($ooooo);  // NULL

 

echo str_replace("Ố", "O", "ỐỐ");  // prints "OO"

 

 

$text="עברית מבולגנת";

 

function hebrewNotWordEndSwitch ($from, $to, $text) {

  $text=

    preg_replace('/'.$from.'([א-ת])/u','$2'.$to.'$1',$text);

  return $text;

}

 

do {

  $text_before=$text;

  $text=hebrewNotWordEndSwitch("ך","כ",$text);

  $text=hebrewNotWordEndSwitch("ם","מ",$text);

  $text=hebrewNotWordEndSwitch("ן","נ",$text);

  $text=hebrewNotWordEndSwitch("ף","פ",$text);

  $text=hebrewNotWordEndSwitch("ץ","צ",$text);

}  while ( $text_before!=$text );

 

print $text; // עברית מסודרת!

 

?>

</body>

</html>

 

 

 

 

I also got a function from http://au2.php.net/manual/en/function.preg-replace.php to test if it was indeed working (the Hebrew one), and it works.

 

So I send the following string to it: ÒÓÔÕÖ . It returns null. I know that if the pattern is malformed utf-8 the function returns null, but I can't see what's wrong with it.

 

Any help would be very much appreciated.

Thanks.

 

 

EDIT: I'm on FreeBSD 6.1 using PHP 5.2.5 (updated yesterday).

Link to comment
https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/
Share on other sites

I think you forgot to escape some parenthesis here at (O)

I don't think you can put a substring or subgroup inside of a character class.

  $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo";  // nothing prints here

 

 

and you also put a caret sign ^ in the middle of a character class which will match ^ literally, like a string literal. If you want it to negate all chars in the char class put it in the begg. of the character class. ie:

[^negatethesechars]

 

 

The problem could also be from your form/script that is running the code. I noticed you are testing with a string being posted from a form. General rule of thumb is keep all data in uniform encoding to ensure that it stays in that encoding. That means having your browser encode in utf-8, forcing your browser to do this with html encoding headers, encoding the actual html file/php script in utf-8 encoding..etc..  A discrepancy in one of these could be changing the uniformity of the utf-8 data being passed around the script.

Unfortunately I set the encodings for both pages to be the same: utf-8.

 

On my side there's no caret in the middle of the character class. Are all the characters coming through properly (they're variations on 'O' with different linguistic add-ons)? Maybe they're encoded wrong and that's the problem, though they should be fine because I got them from the character map.

 

I tried putting parentheses in but it didn't work.

 

Thanks for your help, but are there any other ideas?

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

 

Oh, and I found on this page here http://www.unicode.org/unicode/reports/tr15/ that the NFKD and NFKC will decompose characters like the Æ into A and E (the example is in the introduction and uses the fi ligature character).

 

Thanks a lot for your help.

 

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

 

Thanks.

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

 

Correct.

 

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

 

Are you sure $string is in UTF-8? I'm not sure how PHP "understands" those characters that you've placed directly in the code. What if you convert that to a preg_match--does it detect the characters?

echo "<br/>The encoding of the posted string is " . mb_detect_encoding($_POST['input'], "auto");  // UTF-8

 

function matchbox($string) {
$string = preg_match('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', $string) or print "spungo";
}
...
echo "<br/>Testing preg_match: " . matchbox($_POST['input']) . "<br/>";  // nothing printed from the function

 

I just thought then that it could be something about those particular characters, so I tested with some others:

 

echo "<br/>Testing other characters: " . preg_replace('/[ƁƂʙᴃᴮᴯḂḄḆ]/u', 'B', $_POST['input']) . "<br/>";

with the input string being ᴃᴮᴯḂ, and it converted all of them to 'B'.

 

I'll try replacing the original 'O' RegExp with hex-encoded values for the characters instead and post the results of that.

So now I've put the string of 'O's as octal-escaped, like so:

function oCodes($string) {
// In the above order
// \307\252 goes in between \307\221 and \307\254
$codes = "\303\222\303\223\303\224\303\225\303\226\303\227\303\228\303\229\303\230\305\214\305\216\305\220\306\237\306\240\307\221\307\254\307\276\310\214\310\216\310\252\310\254\310\256\310\260\312\230" . /* now begin the 3-octals, with ᴏ */ "\341\264\217\341\264\274\341\271\214\341\271\216\341\271\220\341\271\222\341\273\214\341\273\216\341\273\220\341\273\222\341\273\224\341\273\226\341\273\230\341\273\232\341\273\234\341\273\236\341\273\240\341\273\242";
return preg_replace("/[$codes]/u", 'O', $string);
}

 

This returned the following warning: Warning: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 14 in /usr/local/www/apache22/data/testformproc.php on line 18

What editor are you using and what character set/encoding are you saving the file in? The example below works for me by matching all of the characters; however, some of them would not work when copied--these would need to be packed. Why not use the other method?

 

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

<pre>

<?php

 

$chars = utf8_encode('ÒÓÔÕÖØO');

 

function matchbox($string) {

global $chars;

$string = preg_match_all('/([' . preg_quote($chars) . '])/u', $string, $matches);

print_r($matches);

}

 

if ($_POST) {

print_r($_POST);

matchbox($_POST['chars']);

}

else {

?>

<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>">

<input type="text" name="chars" value="<?php echo $chars; ?>">

<input type="submit">

</form>

<?php

}

?>

</pre>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.