Jump to content

mafro

Members
  • Posts

    23
  • Joined

  • Last visited

    Never

Everything posted by mafro

  1. Good point. Is there a defined range for combining marks? If there is then mark-detecting would be easy. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it. To summarize for future readers: * filenames are stored on the disk in a variety of character encodings. Windows is a latin1, OSX is MacRoman, which also encodes all chars that have combining marks into their unicode 'decomposed' form. Read Mac OSX section here: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations * the sensible option is to encode everything as utf-8, and normalize decomposed characters where necessary. Resources (2 from effigy's sig!): http://www.joelonsoftware.com/articles/Unicode.html http://en.wikipedia.org/wiki/Precomposed_character http://czyborra.com/utf/#UTF-8 http://www.phpwact.org/php/i18n/charsets http://www.eki.ee/letter/
  2. OK I think ive got this solved now. The attached files show results from OSX and Windows - the interesting thing here is that you can see the Windows one has rendered the diaresis over the preceding character in the second example. If you look closely you can see the difference in how it renders comparing a precomposed to a decomposed utf-8 o-with-diaresis. Based on everything ive learnt I wrote a final script to test all this, which will read files from the disk and then convert them to UTF-8. For OSX it also normalises decomposed unicode into precomposed, using the PEAR I18N_UnicodeNormalizer lib. After that I do a quick DB test to insert and retrieve the data from a UTF-8 field in MySQL. This script works as expected on all platforms. Case solved? I included my script below for anyone else who reads this thread and might want a code example to play with. Cheers for all the input effigy. I think with the tools/knowledge ive picked up ill be able to solve these kind of prob myself in the future! Happy days. mafro <?php //config creates my DB connection include_once("config.php"); //set paths here for each different OS if(strtolower(substr(PHP_OS,0,3)) == 'win') { $root = "e:\\www\\mp3\\ptest\\"; }else if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { $root = "/Users/mafro/Sites/mp3/ptest/"; }else{ $root = "/var/www/mp3/ptest/"; } header("Content-Type: text/html; charset=utf-8"); echo "Content-Type: text/html; charset=utf-8<br/>"; echo PHP_OS.'<br/><br/>'; //read file names from disk $files = scandir($root); foreach($files as $file) { if(substr($file,0,1) != ".") { $test = $file; break; } } echo $test."<br/><br/>"; //convert filenames to utf-8 if(strtolower(substr(PHP_OS,0,6)) == 'darwin') { //normalise and convert OSX $norm = new I18N_UnicodeNormalizer(); $conv = $norm->toNFC($test, 'UTF-8'); }else{ //just convert to utf-8 $conv = mb_convert_encoding($test, "UTF-8", "ISO-8859-1"); } echo $conv."<br/><br/>"; //database test - insert record into UTF-8 field in DB and then attempt to retrieve it $db =& Registry::GetDBConnection(); $sql = "delete from library where hash = 'test'"; $db->query($sql); $sql = "insert into library (hash, filename) values ('test', '".$conv."')"; $db->query($sql); $sql = "select filename from library where hash = 'test'"; $res = $db->query($sql); $row = $res->fetchRow(); echo $row['filename']."<br/><br/>"; ?> [attachment deleted by admin]
  3. Im on XP SP2 here. More what I was getting at with that reference was the bit on OSX - best and most precise description ive found of what started this whole thread. Im gonna do a little testing on all platforms now, where I try to pack() some UTF-8 bytes in PHP. Ill post my results.
  4. For reference: http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
  5. No, im saying that both Windows and Debian encode filenames in iso-8859-1. OSX is the oddity, due to the decomposed chars issue.
  6. Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. So it would be nice if I could have a minimal conversion routine for OSX, and then just use iso-8859-1. There's potential for these filenames (paths) being stored in database, so a uniform charset across the board is good. What dyou think? Just use UTF-8 on all platforms? To be honest, it'd be nice to support OSX but im not going to create myself loads of extra work. Im just glad we've found a work around for future reference.
  7. It looks the same without the pre tags. And checking the monospace font was what I had done.. Courier it is.
  8. Just been playing with I18N_UnicodeNormalizer. If do a toNFC() with a charset of UTF-8 then I get the correct output. After reading from the disk: 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 After using the I18N_UnicodeNormalizer class: 0000000 B j 303 266 r k \n 6a42 b6c3 6b72 000a 0000007 So there's a solution. Nice. If i set my encoding to ISO-8859-1, im getting the same output as the input however - it doesnt convert the o+314+210 into a 366. Cant have it all tho I suppose?!
  9. Im not sure about fonts, but since it's rendering in fixed width im guessing its using Courier which is the setting in the Firefox options. It certainly looks like Courier..
  10. Yes. I'm running Firefox 2.0.0.8 on Windows 2000 Professional. Where did this not work for you? OSX? Yeh im running Firefox 2.0.0.8 on OSX and it displays the question mark. Thats pretty odd really, because we're packing the correct UTF-8 bytes for the diaresis - the browser seems to just want to render them as separate chars. On my windows box the same script in the same browser works as expected! Anyway, I think the mailing list post you uncovered about limewire is basically the issue here - or rather another rendition of the same problem. And I suppose the real solution may lie in using that PEAR library to normalise all filenames read from the disk to their composed form. They can then be prob then be displayed in iso-8859-1 encoding. Ill post some results.
  11. Right I think i may have narrowed it all down now. Using the iso-8859-1 character set, I can get the result from scandir to display on both Windows and Debian. This is the octal dump output, so you can see it's not UTF-8 multibyte encoded. B j 366 r k The only issue remaining in this case then is OSX support. This wont affect me since im only using OSX for development, the actual production version of this will run on Debian. So we're half solved at this stage! And ive learnt a lot about character encoding.. Thanks for all your help effigy, if you've got anymore comments id be interested to hear them. mafro
  12. Ok I found this link: http://people.w3.org/rishida/scripts/uniview/conversion.php This guy has javascript methods which convert between all the codepoints im interested in. Hex, Decimal, UTF-8. Since it's javascript, i could rewrite this in PHP (or whatever language) to help me convert strings into the correct encoding. Obviously, ill need to know what im converting from! I guess id have to do testing across a whole bunch of characters on all platforms, to see the standard. One more thing, testing this stuff on windows and using the standard iso-8859-1 charset, everything works fine. The result from Scandir() displays perfectly in the browser, which is because the filesystem uses that character set underneath I believe.
  13. Just did some testing under Debian. Results are as expected.. The following is the output from Scandir() then processed with the octal dump tool, followed by the output displayed in the browser with UTF-8 character encoding. Debian: 0000000 B j 366 r k \n 6a42 72f6 0a6b 0000006 Bj�rk The hex here means this: 42 6a f6 72 6b 0a B j ö r k \n OSX: 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 Björk And the hex here means this: 42 6a 6f cc88 72 6b 0a B j o <diaresis> r k \n This is good news for me, because it makes sense.. And I understand whats going on. So still the question stands about how to get it display correctly.. Under OSX it displays the two chars separately, and under Debian the ö is displayed as a question mark. Edit: one last example. When I hardcode the string into my PHP script it displays correctly. The octal dump as above is this: 0000000 B j 303 266 r k \n \0 6a42 b6c3 6b72 000a 0000007 Björk And the hex means: 42 6a c3b6 72 6b 0a 00 B j <o-with-diaresis> r k \n Progress?! Im guessing that last one is good because the script is encoded as UTF-8 when I save it, and the display in the browser is the same.
  14. 0000000 B j o 314 210 r k \n 6a42 cc6f 7288 0a6b 0000010 I used shell_exec() to get the octal dump shown above. It's exactly the same output as running that on the command line. In this post: http://www.phpfreaks.com/forums/index.php/topic,164064.msg720106.html#msg720106 You say that it 'works' in firefox. Do you mean you get the correctly rendered o with diaresis? This script doesnt work at my end.. Thanks again effigy.
  15. I set the locale using the sh syntax you posted, and [ls] returns the same result as before (Bjo??rk). On all my tests in Firefox the character encoding is UTF-8. Ive been setting that with a HTTP header and leaving a meta tag in there for good measure.. On one note, as i read file names off the disk i did run mb_detect_encoding() on the string thats returned. They all detect as ASCII apart from the Björk item which is detected as UTF-8.
  16. Here's the output from locale: elisha:~ mafro$ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C" I looked it up on this man page to make sense of it: http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/locale.1.html Is that true? I don't understand why they would do this. Check this link out - i found it via the lists.apple.com link i previously posted: http://developer.apple.com/qa/qa2001/qa1235.html Thanks for that resource effigy, im sure that will be useful at some point. Id like to solve this the correct way however, with the display side.. Which bring us onto the main point: Agreed, but PHP isn't doing the displaying--its output is being sent somewhere. And this is a browser, correct? Im using Firefox latest. And when I say 'displaying chars with PHP', displaying with the browser is what I really mean. You say you've done an example for testing? Could you post it for me please? The thing im seeing is Firefox displaying the latin o and the diaresis as two separate chars - surely there is some encoding Firefox must employ to decide to compose these characters together? Thanks for all the help mafro
  17. Strangely, what I see when I run a straight ls is "Bjo??rk". So the 2 question marks must represent the 314 and 210 returned by od -cx. I have no idea what these values mean. Running your script results in the original latin o, with the combining diaresis: Björk Ive been reading a lot about how unicode can represent these characters as either precomposed or decomposed - and the original reason I was confused was because I was expecting precomposed - it turns out OSX stores all these characters as decomposed internally. If youre interested, this chap has a good rant about the issue in the context of OSX/Java: http://lists.apple.com/archives/Java-dev/2006/Jul/msg00161.html Out interest, how do you translate between 'utf8c=cc+88' and 'ucode=308'? The problem isnt really with PHP. The reason im getting the 2 decomposed chars is OSX's fault, and somehow I need to get the browser to display these 2 chars combined, not treat them as separate values. Im going to revisit the architecture of my app and see if I can do all the filesystem work in Java, where I might be able to process these chars.. Altho it still seems to me that there should be a way of displaying these decomposed unicode chars with PHP?? Processing the chars into precomposed seems like the wrong approach.. Any thoughts effigy? All you input has been much appreciated. mafro
  18. Yeh OSX is a modified BSD under the hood. Im not too sure what the above really means! Some preliminary testing in Java is returning much the same result as PHP, so it seems I just need to find some way of interpreting these bytes into the correct character.
  19. Right right, I get you effigy - its quite possible that youre right there. Im working in OSX right now. Doing copy-paste in the OS is seeming to use UTF-8 for the encoding though! I think ill write a little bit of java to see if it can read the disk in unicode correctly. Altho this is mostly an issue of understanding.. The main problem now, is how do they 'magically' combine the 2 values into one?
  20. I found this in a PDF about unicode and PHP6. Can you explain how the sum works? Perhaps this can used to 'create' the correct character in my script. From http://www.gravitonic.com/downloads/talks/php-quebec-2006/php-6-and-unicode.pdf: Thanks for your help effigy.
  21. I see the UTF-8 one. Because it was saved in UTF-8! This makes sense. The attached file is saved from the browser once my script has run. It is encoded as UTF-8, as per the content-type of the page. The first line is my hardcoded string, the second is read from the disk. mafro [attachment deleted by admin]
  22. Thanks for that one effigy, so now I understand where PHP is getting the 6f+308 chars from. The problem certainly isnt with the display of this data, but working why PHP wont read the data correctly. I have a php script saved in UTF-8 with a hardcoded 'Björk' in there. It displays fine with the correct Content-Type charset. Then I read from the disk, and this one is displayed wrong. Im guessing that internally PHP doesnt handle strings with unicode. Using the link you provided: http://www.eki.ee/letter/chardata.cgi?ucode=f6 That is the character I expect to be read, but im getting it as two separate ones. I guessed this was because the char I wanted was 2 bytes wide! Dyou think there is a way of converting the values im reading into the values I want? Thanks again.
  23. Hey all, I have a problem with reading from the local disk in a PHP jukebox app ive been working on for a couple months. Im posting on this forum in the hope that some reader has encountered this issue before and can resolve it once and for all! Searching the web only returned one answer, and that was not a desirable one - wait for php6.. This issue only applies to Linux (Debian etch) and OSX, running PHP 5.2.4. The issues does not appear on Windows. Im reading a local file structure which contains mp3's and then displaying them in the browser, where the user can click each track to add it to a playlist. There's a java mp3 player part of the app which runs on the same server to handle the audio. The problem is that directories/files which include non-ascii characters aren't read correctly by PHP. Here is an example, the first 2 lines may be the same - if you copy-paste out into a text editor you will see the difference. The second line denotes the unicode decimals for each string, and the third line is built from HTML entities using the unicode values. This will enable your browser to display a representation what PHP reads from my disk. (edit: this forum wont let me put html into my post, so the 3rd line will not render!) /mp3/Björk/ 66 106 246 114 107 &#66; &#106; &#246; &#114; &#107; Is read as: /mp3/Björk/ &#66; &#106; &#111; &#776; &#114; &#107; It was my understanding that PHP would directly read binary data off the disk, where I could interpret it as whichever charset I desire. I believe OSX and Debian use utf-8 as their underlying charset. Converting this using mb_string in PHP doesnt work, and I am at a loss as to how this can be dealt with correctly. I am using utf-8 encoding through out my application, this problem only exists when reading from the disk! I can also provide some of my PHP test scripts should anyone like to attempt to solve this issue on their local machine.. Thanks for any/all help or suggestions. mafro
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.