Jump to content

UTF-8 preg_replace returning null


Paperstyle

Recommended Posts

I'm working on a function to Anglicise a string. Here's part of it in a testing assembly:

(apologies for it not being in code tags, but it was converting the characters and making it less readable)

 

 

 

<html>

<head>

<title> splurd </title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

</head>

<body>

<?php

 

function oooo($string) {

$string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo";  // nothing prints here

}

 

$ooooo = "$_POST[input]";

$ooooo = oooo($ooooo);

 

echo "<h1>$ooooo</h1>";  // nothing inside the header tags

echo ord($ooooo);  // 0

var_dump($ooooo);  // NULL

 

echo str_replace("Ố", "O", "ỐỐ");  // prints "OO"

 

 

$text="עברית מבולגנת";

 

function hebrewNotWordEndSwitch ($from, $to, $text) {

  $text=

    preg_replace('/'.$from.'([א-ת])/u','$2'.$to.'$1',$text);

  return $text;

}

 

do {

  $text_before=$text;

  $text=hebrewNotWordEndSwitch("ך","כ",$text);

  $text=hebrewNotWordEndSwitch("ם","מ",$text);

  $text=hebrewNotWordEndSwitch("ן","נ",$text);

  $text=hebrewNotWordEndSwitch("ף","פ",$text);

  $text=hebrewNotWordEndSwitch("ץ","צ",$text);

}  while ( $text_before!=$text );

 

print $text; // עברית מסודרת!

 

?>

</body>

</html>

 

 

 

 

I also got a function from http://au2.php.net/manual/en/function.preg-replace.php to test if it was indeed working (the Hebrew one), and it works.

 

So I send the following string to it: ÒÓÔÕÖ . It returns null. I know that if the pattern is malformed utf-8 the function returns null, but I can't see what's wrong with it.

 

Any help would be very much appreciated.

Thanks.

 

 

EDIT: I'm on FreeBSD 6.1 using PHP 5.2.5 (updated yesterday).

Link to comment
Share on other sites

I think you forgot to escape some parenthesis here at (O)

I don't think you can put a substring or subgroup inside of a character class.

  $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo";  // nothing prints here

 

 

and you also put a caret sign ^ in the middle of a character class which will match ^ literally, like a string literal. If you want it to negate all chars in the char class put it in the begg. of the character class. ie:

[^negatethesechars]

 

 

The problem could also be from your form/script that is running the code. I noticed you are testing with a string being posted from a form. General rule of thumb is keep all data in uniform encoding to ensure that it stays in that encoding. That means having your browser encode in utf-8, forcing your browser to do this with html encoding headers, encoding the actual html file/php script in utf-8 encoding..etc..  A discrepancy in one of these could be changing the uniformity of the utf-8 data being passed around the script.

Link to comment
Share on other sites

Unfortunately I set the encodings for both pages to be the same: utf-8.

 

On my side there's no caret in the middle of the character class. Are all the characters coming through properly (they're variations on 'O' with different linguistic add-ons)? Maybe they're encoded wrong and that's the problem, though they should be fine because I got them from the character map.

 

I tried putting parentheses in but it didn't work.

 

Thanks for your help, but are there any other ideas?

Link to comment
Share on other sites

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

 

Oh, and I found on this page here http://www.unicode.org/unicode/reports/tr15/ that the NFKD and NFKC will decompose characters like the Æ into A and E (the example is in the introduction and uses the fi ligature character).

 

Thanks a lot for your help.

 

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

 

Thanks.

Link to comment
Share on other sites

So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code?

 

Correct.

 

Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it.

 

Are you sure $string is in UTF-8? I'm not sure how PHP "understands" those characters that you've placed directly in the code. What if you convert that to a preg_match--does it detect the characters?

Link to comment
Share on other sites

echo "<br/>The encoding of the posted string is " . mb_detect_encoding($_POST['input'], "auto");  // UTF-8

 

function matchbox($string) {
$string = preg_match('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', $string) or print "spungo";
}
...
echo "<br/>Testing preg_match: " . matchbox($_POST['input']) . "<br/>";  // nothing printed from the function

 

I just thought then that it could be something about those particular characters, so I tested with some others:

 

echo "<br/>Testing other characters: " . preg_replace('/[ƁƂʙᴃᴮᴯḂḄḆ]/u', 'B', $_POST['input']) . "<br/>";

with the input string being ᴃᴮᴯḂ, and it converted all of them to 'B'.

 

I'll try replacing the original 'O' RegExp with hex-encoded values for the characters instead and post the results of that.

Link to comment
Share on other sites

So now I've put the string of 'O's as octal-escaped, like so:

function oCodes($string) {
// In the above order
// \307\252 goes in between \307\221 and \307\254
$codes = "\303\222\303\223\303\224\303\225\303\226\303\227\303\228\303\229\303\230\305\214\305\216\305\220\306\237\306\240\307\221\307\254\307\276\310\214\310\216\310\252\310\254\310\256\310\260\312\230" . /* now begin the 3-octals, with ᴏ */ "\341\264\217\341\264\274\341\271\214\341\271\216\341\271\220\341\271\222\341\273\214\341\273\216\341\273\220\341\273\222\341\273\224\341\273\226\341\273\230\341\273\232\341\273\234\341\273\236\341\273\240\341\273\242";
return preg_replace("/[$codes]/u", 'O', $string);
}

 

This returned the following warning: Warning: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 14 in /usr/local/www/apache22/data/testformproc.php on line 18

Link to comment
Share on other sites

What editor are you using and what character set/encoding are you saving the file in? The example below works for me by matching all of the characters; however, some of them would not work when copied--these would need to be packed. Why not use the other method?

 

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

<pre>

<?php

 

$chars = utf8_encode('ÒÓÔÕÖØO');

 

function matchbox($string) {

global $chars;

$string = preg_match_all('/([' . preg_quote($chars) . '])/u', $string, $matches);

print_r($matches);

}

 

if ($_POST) {

print_r($_POST);

matchbox($_POST['chars']);

}

else {

?>

<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>">

<input type="text" name="chars" value="<?php echo $chars; ?>">

<input type="submit">

</form>

<?php

}

?>

</pre>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.