effigy Posted October 22, 2007 Share Posted October 22, 2007 Actually, now I think about it I had pre tags in the example. Go to Tools->Options->Content->Advanced and see what's listed under "Monospace." You may want to try removing the pre tags to see if the results differ. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375684 Share on other sites More sharing options...
mafro Posted October 22, 2007 Author Share Posted October 22, 2007 Just been playing with I18N_UnicodeNormalizer. If do a toNFC() with a charset of UTF-8 then I get the correct output. After reading from the disk: 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 After using the I18N_UnicodeNormalizer class: 0000000 B j 303 266 r k \n 6a42 b6c3 6b72 000a 0000007 So there's a solution. Nice. If i set my encoding to ISO-8859-1, im getting the same output as the input however - it doesnt convert the o+314+210 into a 366. Cant have it all tho I suppose?! Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375690 Share on other sites More sharing options...
mafro Posted October 22, 2007 Author Share Posted October 22, 2007 Actually, now I think about it I had pre tags in the example. Go to Tools->Options->Content->Advanced and see what's listed under "Monospace." You may want to try removing the pre tags to see if the results differ. It looks the same without the pre tags. And checking the monospace font was what I had done.. Courier it is. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375693 Share on other sites More sharing options...
effigy Posted October 22, 2007 Share Posted October 22, 2007 If i set my encoding to ISO-8859-1, im getting the same output as the input I'm confused. Why are you trying to use ISO-8859-1? Shouldn't everything be in UTF-8? Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375701 Share on other sites More sharing options...
mafro Posted October 22, 2007 Author Share Posted October 22, 2007 Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. So it would be nice if I could have a minimal conversion routine for OSX, and then just use iso-8859-1. There's potential for these filenames (paths) being stored in database, so a uniform charset across the board is good. What dyou think? Just use UTF-8 on all platforms? To be honest, it'd be nice to support OSX but im not going to create myself loads of extra work. Im just glad we've found a work around for future reference. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375713 Share on other sites More sharing options...
effigy Posted October 22, 2007 Share Posted October 22, 2007 Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. I'm not sure how Windows works, but, at a basic level, I know you can change your locale and character set on Linux. I can change to the "en_US.UTF-8" locale and it will display whatever it reads as UTF-8. From a programming standpoint, you should know what encoding is there, read it, then tell the browser what it's getting. Based on what I've seen, your data is stored in UTF-8. Are you saying all 3 operating systems are storing (encoding) the same name differently? Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-375754 Share on other sites More sharing options...
mafro Posted October 23, 2007 Author Share Posted October 23, 2007 Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. No, im saying that both Windows and Debian encode filenames in iso-8859-1. OSX is the oddity, due to the decomposed chars issue. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-376066 Share on other sites More sharing options...
mafro Posted October 23, 2007 Author Share Posted October 23, 2007 For reference: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-376090 Share on other sites More sharing options...
effigy Posted October 23, 2007 Share Posted October 23, 2007 Windows... encode filenames in iso-8859-1 Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8. In others words, the file is encoded in UTF-8. U+FEFF breaks down to EF BB BF in UTF-8, and these just happen to exist in ISO-8859-1 because they don't exceed FF. If an application does not properly read the byte sequence or does not support UTF-8, then the fall back is ISO-8859-1, which will render . This has nothing to do with the file name, but the actual contents. Which version of Windows are you using? Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-376215 Share on other sites More sharing options...
mafro Posted October 24, 2007 Author Share Posted October 24, 2007 Im on XP SP2 here. More what I was getting at with that reference was the bit on OSX - best and most precise description ive found of what started this whole thread. Im gonna do a little testing on all platforms now, where I try to pack() some UTF-8 bytes in PHP. Ill post my results. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-376919 Share on other sites More sharing options...
effigy Posted October 24, 2007 Share Posted October 24, 2007 You could also look for combining marks to process: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <?php $tests = array( ### UTF-8 composed. 'Bj' . pack('c*', 0xC3, 0xB6) . 'rk', ### UTF-8 decomposed. 'Bjo' . pack('c*', 0xCC, 0x88) . 'rk', ); foreach ($tests as $test) { echo "$test => ", preg_match('/\p{M}/u', $test) ? 'Has Mark' : 'Does not have Mark' ; echo '<br>'; } ?> For me this outputs: Björk => Does not have Mark Björk => Has Mark Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-376978 Share on other sites More sharing options...
mafro Posted October 25, 2007 Author Share Posted October 25, 2007 OK I think ive got this solved now. The attached files show results from OSX and Windows - the interesting thing here is that you can see the Windows one has rendered the diaresis over the preceding character in the second example. If you look closely you can see the difference in how it renders comparing a precomposed to a decomposed utf-8 o-with-diaresis. Based on everything ive learnt I wrote a final script to test all this, which will read files from the disk and then convert them to UTF-8. For OSX it also normalises decomposed unicode into precomposed, using the PEAR I18N_UnicodeNormalizer lib. After that I do a quick DB test to insert and retrieve the data from a UTF-8 field in MySQL. This script works as expected on all platforms. Case solved? I included my script below for anyone else who reads this thread and might want a code example to play with. Cheers for all the input effigy. I think with the tools/knowledge ive picked up ill be able to solve these kind of prob myself in the future! Happy days. mafro <?php //config creates my DB connection include_once("config.php"); //set paths here for each different OS if(strtolower(substr(PHP_OS,0,3)) == 'win') { $root = "e:\\www\\mp3\\ptest\\"; }else if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { $root = "/Users/mafro/Sites/mp3/ptest/"; }else{ $root = "/var/www/mp3/ptest/"; } header("Content-Type: text/html; charset=utf-8"); echo "Content-Type: text/html; charset=utf-8<br/>"; echo PHP_OS.'<br/><br/>'; //read file names from disk $files = scandir($root); foreach($files as $file) { if(substr($file,0,1) != ".") { $test = $file; break; } } echo $test."<br/><br/>"; //convert filenames to utf-8 if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { //normalise and convert OSX $norm = new I18N_UnicodeNormalizer(); $conv = $norm->toNFC($test, 'UTF-8'); }else{ //just convert to utf-8 $conv = mb_convert_encoding($test, "UTF-8", "ISO-8859-1"); } echo $conv."<br/><br/>"; //database test - insert record into UTF-8 field in DB and then attempt to retrieve it $db =& Registry::GetDBConnection(); $sql = "delete from library where hash = 'test'"; $db->query($sql); $sql = "insert into library (hash, filename) values ('test', '".$conv."')"; $db->query($sql); $sql = "select filename from library where hash = 'test'"; $res = $db->query($sql); $row = $res->fetchRow(); echo $row['filename']."<br/><br/>"; ?> [attachment deleted by admin] Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-377711 Share on other sites More sharing options...
effigy Posted October 25, 2007 Share Posted October 25, 2007 Sounds good. Even though you haven't encountered combining marks in Linux or Windows yet, they can still be used. The better solution may be the mark-searching route, rather than an OS-specific one. That is, if you detect a mark, send it through the Unicode Normalizer. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-377814 Share on other sites More sharing options...
mafro Posted October 25, 2007 Author Share Posted October 25, 2007 Good point. Is there a defined range for combining marks? If there is then mark-detecting would be easy. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it. To summarize for future readers: * filenames are stored on the disk in a variety of character encodings. Windows is a latin1, OSX is MacRoman, which also encodes all chars that have combining marks into their unicode 'decomposed' form. Read Mac OSX section here: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations * the sensible option is to encode everything as utf-8, and normalize decomposed characters where necessary. Resources (2 from effigy's sig!): http://www.joelonsoftware.com/articles/Unicode.html http://en.wikipedia.org/wiki/Precomposed_character http://czyborra.com/utf/#UTF-8 http://www.phpwact.org/php/i18n/charsets http://www.eki.ee/letter/ Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-377842 Share on other sites More sharing options...
effigy Posted October 25, 2007 Share Posted October 25, 2007 Windows is a latin1 This isn't true for all versions: NTFS allows any sequence of short (16-bit) values for name encoding (file names, stream names, index names, etc.). This means UTF-16 codepoints are supported, Source: NTFS. Is there a defined range for combining marks? Yes. Combining Diacritical Marks (0300-036F) Combining Diacritical Marks for Symbols (20D0-20FF) Combining Half Marks (FE20-FE2F) These should all be caught by the regular expression example I posted. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it. Correct. This might be worth testing/benchmarking--the regex vs. always normalizing. Quote Link to comment https://forums.phpfreaks.com/topic/73772-unicode-disk-reading/page/2/#findComment-377873 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.