Paperstyle Posted April 1, 2008 Share Posted April 1, 2008 I'm working on a function to Anglicise a string. Here's part of it in a testing assembly: (apologies for it not being in code tags, but it was converting the characters and making it less readable) <html> <head> <title> splurd </title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <?php function oooo($string) { $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here } $ooooo = "$_POST[input]"; $ooooo = oooo($ooooo); echo "<h1>$ooooo</h1>"; // nothing inside the header tags echo ord($ooooo); // 0 var_dump($ooooo); // NULL echo str_replace("Ố", "O", "ỐỐ"); // prints "OO" $text="עברית מבולגנת"; function hebrewNotWordEndSwitch ($from, $to, $text) { $text= preg_replace('/'.$from.'([א-ת])/u','$2'.$to.'$1',$text); return $text; } do { $text_before=$text; $text=hebrewNotWordEndSwitch("ך","כ",$text); $text=hebrewNotWordEndSwitch("ם","מ",$text); $text=hebrewNotWordEndSwitch("ן","נ",$text); $text=hebrewNotWordEndSwitch("ף","פ",$text); $text=hebrewNotWordEndSwitch("ץ","צ",$text); } while ( $text_before!=$text ); print $text; // עברית מסודרת! ?> </body> </html> I also got a function from http://au2.php.net/manual/en/function.preg-replace.php to test if it was indeed working (the Hebrew one), and it works. So I send the following string to it: ÒÓÔÕÖ . It returns null. I know that if the pattern is malformed utf-8 the function returns null, but I can't see what's wrong with it. Any help would be very much appreciated. Thanks. EDIT: I'm on FreeBSD 6.1 using PHP 5.2.5 (updated yesterday). Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/ Share on other sites More sharing options...
dsaba Posted April 1, 2008 Share Posted April 1, 2008 I think you forgot to escape some parenthesis here at (O) I don't think you can put a substring or subgroup inside of a character class. $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here and you also put a caret sign ^ in the middle of a character class which will match ^ literally, like a string literal. If you want it to negate all chars in the char class put it in the begg. of the character class. ie: [^negatethesechars] The problem could also be from your form/script that is running the code. I noticed you are testing with a string being posted from a form. General rule of thumb is keep all data in uniform encoding to ensure that it stays in that encoding. That means having your browser encode in utf-8, forcing your browser to do this with html encoding headers, encoding the actual html file/php script in utf-8 encoding..etc.. A discrepancy in one of these could be changing the uniformity of the utf-8 data being passed around the script. Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-506769 Share on other sites More sharing options...
Paperstyle Posted April 2, 2008 Author Share Posted April 2, 2008 Unfortunately I set the encodings for both pages to be the same: utf-8. On my side there's no caret in the middle of the character class. Are all the characters coming through properly (they're variations on 'O' with different linguistic add-ons)? Maybe they're encoded wrong and that's the problem, though they should be fine because I got them from the character map. I tried putting parentheses in but it didn't work. Thanks for your help, but are there any other ideas? Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-507411 Share on other sites More sharing options...
effigy Posted April 2, 2008 Share Posted April 2, 2008 How about this? Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-507473 Share on other sites More sharing options...
Paperstyle Posted April 3, 2008 Author Share Posted April 3, 2008 So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code? Oh, and I found on this page here http://www.unicode.org/unicode/reports/tr15/ that the NFKD and NFKC will decompose characters like the Æ into A and E (the example is in the introduction and uses the fi ligature character). Thanks a lot for your help. Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it. Thanks. Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-508232 Share on other sites More sharing options...
effigy Posted April 3, 2008 Share Posted April 3, 2008 So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code? Correct. Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it. Are you sure $string is in UTF-8? I'm not sure how PHP "understands" those characters that you've placed directly in the code. What if you convert that to a preg_match--does it detect the characters? Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-508543 Share on other sites More sharing options...
Paperstyle Posted April 4, 2008 Author Share Posted April 4, 2008 echo "<br/>The encoding of the posted string is " . mb_detect_encoding($_POST['input'], "auto"); // UTF-8 function matchbox($string) { $string = preg_match('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', $string) or print "spungo"; } ... echo "<br/>Testing preg_match: " . matchbox($_POST['input']) . "<br/>"; // nothing printed from the function I just thought then that it could be something about those particular characters, so I tested with some others: echo "<br/>Testing other characters: " . preg_replace('/[ƁƂʙᴃᴮᴯḂḄḆ]/u', 'B', $_POST['input']) . "<br/>"; with the input string being ᴃᴮᴯḂ, and it converted all of them to 'B'. I'll try replacing the original 'O' RegExp with hex-encoded values for the characters instead and post the results of that. Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-508980 Share on other sites More sharing options...
Paperstyle Posted April 4, 2008 Author Share Posted April 4, 2008 So now I've put the string of 'O's as octal-escaped, like so: function oCodes($string) { // In the above order // \307\252 goes in between \307\221 and \307\254 $codes = "\303\222\303\223\303\224\303\225\303\226\303\227\303\228\303\229\303\230\305\214\305\216\305\220\306\237\306\240\307\221\307\254\307\276\310\214\310\216\310\252\310\254\310\256\310\260\312\230" . /* now begin the 3-octals, with ᴏ */ "\341\264\217\341\264\274\341\271\214\341\271\216\341\271\220\341\271\222\341\273\214\341\273\216\341\273\220\341\273\222\341\273\224\341\273\226\341\273\230\341\273\232\341\273\234\341\273\236\341\273\240\341\273\242"; return preg_replace("/[$codes]/u", 'O', $string); } This returned the following warning: Warning: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 14 in /usr/local/www/apache22/data/testformproc.php on line 18 Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-509079 Share on other sites More sharing options...
effigy Posted April 4, 2008 Share Posted April 4, 2008 What editor are you using and what character set/encoding are you saving the file in? The example below works for me by matching all of the characters; however, some of them would not work when copied--these would need to be packed. Why not use the other method? <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <pre> <?php $chars = utf8_encode('ÒÓÔÕÖØO'); function matchbox($string) { global $chars; $string = preg_match_all('/([' . preg_quote($chars) . '])/u', $string, $matches); print_r($matches); } if ($_POST) { print_r($_POST); matchbox($_POST['chars']); } else { ?> <form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>"> <input type="text" name="chars" value="<?php echo $chars; ?>"> <input type="submit"> </form> <?php } ?> </pre> Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-509236 Share on other sites More sharing options...
Paperstyle Posted April 5, 2008 Author Share Posted April 5, 2008 I will use the other method. I just thought there might be a simple explanation of why it wasn't working. Editor: gedit Encoding: UTF-8 Thanks plenty for all your help. Link to comment https://forums.phpfreaks.com/topic/98945-utf-8-preg_replace-returning-null/#findComment-509712 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.