Jump to content

mafro

Members
  • Posts

    23
  • Joined

  • Last visited

    Never

Profile Information

  • Gender
    Not Telling

mafro's Achievements

Newbie

Newbie (1/5)

0

Reputation

  1. Good point. Is there a defined range for combining marks? If there is then mark-detecting would be easy. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it. To summarize for future readers: * filenames are stored on the disk in a variety of character encodings. Windows is a latin1, OSX is MacRoman, which also encodes all chars that have combining marks into their unicode 'decomposed' form. Read Mac OSX section here: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations * the sensible option is to encode everything as utf-8, and normalize decomposed characters where necessary. Resources (2 from effigy's sig!): http://www.joelonsoftware.com/articles/Unicode.html http://en.wikipedia.org/wiki/Precomposed_character http://czyborra.com/utf/#UTF-8 http://www.phpwact.org/php/i18n/charsets http://www.eki.ee/letter/
  2. OK I think ive got this solved now. The attached files show results from OSX and Windows - the interesting thing here is that you can see the Windows one has rendered the diaresis over the preceding character in the second example. If you look closely you can see the difference in how it renders comparing a precomposed to a decomposed utf-8 o-with-diaresis. Based on everything ive learnt I wrote a final script to test all this, which will read files from the disk and then convert them to UTF-8. For OSX it also normalises decomposed unicode into precomposed, using the PEAR I18N_UnicodeNormalizer lib. After that I do a quick DB test to insert and retrieve the data from a UTF-8 field in MySQL. This script works as expected on all platforms. Case solved? I included my script below for anyone else who reads this thread and might want a code example to play with. Cheers for all the input effigy. I think with the tools/knowledge ive picked up ill be able to solve these kind of prob myself in the future! Happy days. mafro <?php //config creates my DB connection include_once("config.php"); //set paths here for each different OS if(strtolower(substr(PHP_OS,0,3)) == 'win') { $root = "e:\\www\\mp3\\ptest\\"; }else if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { $root = "/Users/mafro/Sites/mp3/ptest/"; }else{ $root = "/var/www/mp3/ptest/"; } header("Content-Type: text/html; charset=utf-8"); echo "Content-Type: text/html; charset=utf-8<br/>"; echo PHP_OS.'<br/><br/>'; //read file names from disk $files = scandir($root); foreach($files as $file) { if(substr($file,0,1) != ".") { $test = $file; break; } } echo $test."<br/><br/>"; //convert filenames to utf-8 if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { //normalise and convert OSX $norm = new I18N_UnicodeNormalizer(); $conv = $norm->toNFC($test, 'UTF-8'); }else{ //just convert to utf-8 $conv = mb_convert_encoding($test, "UTF-8", "ISO-8859-1"); } echo $conv."<br/><br/>"; //database test - insert record into UTF-8 field in DB and then attempt to retrieve it $db =& Registry::GetDBConnection(); $sql = "delete from library where hash = 'test'"; $db->query($sql); $sql = "insert into library (hash, filename) values ('test', '".$conv."')"; $db->query($sql); $sql = "select filename from library where hash = 'test'"; $res = $db->query($sql); $row = $res->fetchRow(); echo $row['filename']."<br/><br/>"; ?> [attachment deleted by admin]
  3. Im on XP SP2 here. More what I was getting at with that reference was the bit on OSX - best and most precise description ive found of what started this whole thread. Im gonna do a little testing on all platforms now, where I try to pack() some UTF-8 bytes in PHP. Ill post my results.
  4. For reference: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
  5. No, im saying that both Windows and Debian encode filenames in iso-8859-1. OSX is the oddity, due to the decomposed chars issue.
  6. Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. So it would be nice if I could have a minimal conversion routine for OSX, and then just use iso-8859-1. There's potential for these filenames (paths) being stored in database, so a uniform charset across the board is good. What dyou think? Just use UTF-8 on all platforms? To be honest, it'd be nice to support OSX but im not going to create myself loads of extra work. Im just glad we've found a work around for future reference.
  7. It looks the same without the pre tags. And checking the monospace font was what I had done.. Courier it is.
  8. Just been playing with I18N_UnicodeNormalizer. If do a toNFC() with a charset of UTF-8 then I get the correct output. After reading from the disk: 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 After using the I18N_UnicodeNormalizer class: 0000000 B j 303 266 r k \n 6a42 b6c3 6b72 000a 0000007 So there's a solution. Nice. If i set my encoding to ISO-8859-1, im getting the same output as the input however - it doesnt convert the o+314+210 into a 366. Cant have it all tho I suppose?!
  9. Im not sure about fonts, but since it's rendering in fixed width im guessing its using Courier which is the setting in the Firefox options. It certainly looks like Courier..
  10. Yes. I'm running Firefox 2.0.0.8 on Windows 2000 Professional. Where did this not work for you? OSX? Yeh im running Firefox 2.0.0.8 on OSX and it displays the question mark. Thats pretty odd really, because we're packing the correct UTF-8 bytes for the diaresis - the browser seems to just want to render them as separate chars. On my windows box the same script in the same browser works as expected! Anyway, I think the mailing list post you uncovered about limewire is basically the issue here - or rather another rendition of the same problem. And I suppose the real solution may lie in using that PEAR library to normalise all filenames read from the disk to their composed form. They can then be prob then be displayed in iso-8859-1 encoding. Ill post some results.
  11. Right I think i may have narrowed it all down now. Using the iso-8859-1 character set, I can get the result from scandir to display on both Windows and Debian. This is the octal dump output, so you can see it's not UTF-8 multibyte encoded. B j 366 r k The only issue remaining in this case then is OSX support. This wont affect me since im only using OSX for development, the actual production version of this will run on Debian. So we're half solved at this stage! And ive learnt a lot about character encoding.. Thanks for all your help effigy, if you've got anymore comments id be interested to hear them. mafro
  12. Ok I found this link: http://people.w3.org/rishida/scripts/uniview/conversion.php This guy has javascript methods which convert between all the codepoints im interested in. Hex, Decimal, UTF-8. Since it's javascript, i could rewrite this in PHP (or whatever language) to help me convert strings into the correct encoding. Obviously, ill need to know what im converting from! I guess id have to do testing across a whole bunch of characters on all platforms, to see the standard. One more thing, testing this stuff on windows and using the standard iso-8859-1 charset, everything works fine. The result from Scandir() displays perfectly in the browser, which is because the filesystem uses that character set underneath I believe.
  13. Just did some testing under Debian. Results are as expected.. The following is the output from Scandir() then processed with the octal dump tool, followed by the output displayed in the browser with UTF-8 character encoding. Debian: 0000000 B j 366 r k \n 6a42 72f6 0a6b 0000006 Bj�rk The hex here means this: 42 6a f6 72 6b 0a B j ö r k \n OSX: 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 Björk And the hex here means this: 42 6a 6f cc88 72 6b 0a B j o <diaresis> r k \n This is good news for me, because it makes sense.. And I understand whats going on. So still the question stands about how to get it display correctly.. Under OSX it displays the two chars separately, and under Debian the ö is displayed as a question mark. Edit: one last example. When I hardcode the string into my PHP script it displays correctly. The octal dump as above is this: 0000000 B j 303 266 r k \n \0 6a42 b6c3 6b72 000a 0000007 Björk And the hex means: 42 6a c3b6 72 6b 0a 00 B j <o-with-diaresis> r k \n Progress?! Im guessing that last one is good because the script is encoded as UTF-8 when I save it, and the display in the browser is the same.
  14. 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 I used shell_exec() to get the octal dump shown above. It's exactly the same output as running that on the command line. In this post: http://www.phpfreaks.com/forums/index.php/topic,164064.msg720106.html#msg720106 You say that it 'works' in firefox. Do you mean you get the correctly rendered o with diaresis? This script doesnt work at my end.. Thanks again effigy.
  15. I set the locale using the sh syntax you posted, and [ls] returns the same result as before (Bjo??rk). On all my tests in Firefox the character encoding is UTF-8. Ive been setting that with a HTTP header and leaving a meta tag in there for good measure.. On one note, as i read file names off the disk i did run mb_detect_encoding() on the string thats returned. They all detect as ASCII apart from the Björk item which is detected as UTF-8.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.