Paperstyle Posted April 1, 2008 Share Posted April 1, 2008 I'm working on a function to Anglicise a string. Here's part of it in a testing assembly: (apologies for it not being in code tags, but it was converting the characters and making it less readable) <html> <head> <title> splurd </title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <?php function oooo($string) { $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here } $ooooo = "$_POST[input]"; $ooooo = oooo($ooooo); echo "<h1>$ooooo</h1>"; // nothing inside the header tags echo ord($ooooo); // 0 var_dump($ooooo); // NULL echo str_replace("Ố", "O", "ỐỐ"); // prints "OO" $text="עברית מבולגנת"; function hebrewNotWordEndSwitch ($from, $to, $text) { $text= preg_replace('/'.$from.'([א-ת])/u','$2'.$to.'$1',$text); return $text; } do { $text_before=$text; $text=hebrewNotWordEndSwitch("ך","כ",$text); $text=hebrewNotWordEndSwitch("ם","מ",$text); $text=hebrewNotWordEndSwitch("ן","נ",$text); $text=hebrewNotWordEndSwitch("ף","פ",$text); $text=hebrewNotWordEndSwitch("ץ","צ",$text); } while ( $text_before!=$text ); print $text; // עברית מסודרת! ?> </body> </html> I also got a function from http://au2.php.net/manual/en/function.preg-replace.php to test if it was indeed working (the Hebrew one), and it works. So I send the following string to it: ÒÓÔÕÖ . It returns null. I know that if the pattern is malformed utf-8 the function returns null, but I can't see what's wrong with it. Any help would be very much appreciated. Thanks. EDIT: I'm on FreeBSD 6.1 using PHP 5.2.5 (updated yesterday). Quote Link to comment Share on other sites More sharing options...
dsaba Posted April 1, 2008 Share Posted April 1, 2008 I think you forgot to escape some parenthesis here at (O) I don't think you can put a substring or subgroup inside of a character class. $string = /*@*/preg_replace('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', 'O', $string) or print "pungo"; // nothing prints here and you also put a caret sign ^ in the middle of a character class which will match ^ literally, like a string literal. If you want it to negate all chars in the char class put it in the begg. of the character class. ie: [^negatethesechars] The problem could also be from your form/script that is running the code. I noticed you are testing with a string being posted from a form. General rule of thumb is keep all data in uniform encoding to ensure that it stays in that encoding. That means having your browser encode in utf-8, forcing your browser to do this with html encoding headers, encoding the actual html file/php script in utf-8 encoding..etc.. A discrepancy in one of these could be changing the uniformity of the utf-8 data being passed around the script. Quote Link to comment Share on other sites More sharing options...
Paperstyle Posted April 2, 2008 Author Share Posted April 2, 2008 Unfortunately I set the encodings for both pages to be the same: utf-8. On my side there's no caret in the middle of the character class. Are all the characters coming through properly (they're variations on 'O' with different linguistic add-ons)? Maybe they're encoded wrong and that's the problem, though they should be fine because I got them from the character map. I tried putting parentheses in but it didn't work. Thanks for your help, but are there any other ideas? Quote Link to comment Share on other sites More sharing options...
effigy Posted April 2, 2008 Share Posted April 2, 2008 How about this? Quote Link to comment Share on other sites More sharing options...
Paperstyle Posted April 3, 2008 Author Share Posted April 3, 2008 So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code? Oh, and I found on this page here http://www.unicode.org/unicode/reports/tr15/ that the NFKD and NFKC will decompose characters like the Æ into A and E (the example is in the introduction and uses the fi ligature character). Thanks a lot for your help. Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it. Thanks. Quote Link to comment Share on other sites More sharing options...
effigy Posted April 3, 2008 Share Posted April 3, 2008 So just so I make sure I've got this right: I can use I18N_UnicodeNormalizer package which will split characters effectively into their component parts, then I use a RegExp to remove the marks, which are in the \p{M} property code? Correct. Just as a curiosity, can you see anything wrong with my original RegExp? I really can't understand it. Are you sure $string is in UTF-8? I'm not sure how PHP "understands" those characters that you've placed directly in the code. What if you convert that to a preg_match--does it detect the characters? Quote Link to comment Share on other sites More sharing options...
Paperstyle Posted April 4, 2008 Author Share Posted April 4, 2008 echo "<br/>The encoding of the posted string is " . mb_detect_encoding($_POST['input'], "auto"); // UTF-8 function matchbox($string) { $string = preg_match('/[ÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰᴏᴼṌṎṐṒỌỎỒỔỖỘỚỜỞỠỢ]/u', $string) or print "spungo"; } ... echo "<br/>Testing preg_match: " . matchbox($_POST['input']) . "<br/>"; // nothing printed from the function I just thought then that it could be something about those particular characters, so I tested with some others: echo "<br/>Testing other characters: " . preg_replace('/[ƁƂʙᴃᴮᴯḂḄḆ]/u', 'B', $_POST['input']) . "<br/>"; with the input string being ᴃᴮᴯḂ, and it converted all of them to 'B'. I'll try replacing the original 'O' RegExp with hex-encoded values for the characters instead and post the results of that. Quote Link to comment Share on other sites More sharing options...
Paperstyle Posted April 4, 2008 Author Share Posted April 4, 2008 So now I've put the string of 'O's as octal-escaped, like so: function oCodes($string) { // In the above order // \307\252 goes in between \307\221 and \307\254 $codes = "\303\222\303\223\303\224\303\225\303\226\303\227\303\228\303\229\303\230\305\214\305\216\305\220\306\237\306\240\307\221\307\254\307\276\310\214\310\216\310\252\310\254\310\256\310\260\312\230" . /* now begin the 3-octals, with ᴏ */ "\341\264\217\341\264\274\341\271\214\341\271\216\341\271\220\341\271\222\341\273\214\341\273\216\341\273\220\341\273\222\341\273\224\341\273\226\341\273\230\341\273\232\341\273\234\341\273\236\341\273\240\341\273\242"; return preg_replace("/[$codes]/u", 'O', $string); } This returned the following warning: Warning: preg_replace() [function.preg-replace]: Compilation failed: invalid UTF-8 string at offset 14 in /usr/local/www/apache22/data/testformproc.php on line 18 Quote Link to comment Share on other sites More sharing options...
effigy Posted April 4, 2008 Share Posted April 4, 2008 What editor are you using and what character set/encoding are you saving the file in? The example below works for me by matching all of the characters; however, some of them would not work when copied--these would need to be packed. Why not use the other method? <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <pre> <?php $chars = utf8_encode('ÒÓÔÕÖØO'); function matchbox($string) { global $chars; $string = preg_match_all('/([' . preg_quote($chars) . '])/u', $string, $matches); print_r($matches); } if ($_POST) { print_r($_POST); matchbox($_POST['chars']); } else { ?> <form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>"> <input type="text" name="chars" value="<?php echo $chars; ?>"> <input type="submit"> </form> <?php } ?> </pre> Quote Link to comment Share on other sites More sharing options...
Paperstyle Posted April 5, 2008 Author Share Posted April 5, 2008 I will use the other method. I just thought there might be a simple explanation of why it wasn't working. Editor: gedit Encoding: UTF-8 Thanks plenty for all your help. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.