Jump to content

Unicode Disk Reading


mafro

Recommended Posts

Hey all,

 

I have a problem with reading from the local disk in a PHP jukebox app ive been working on for a couple months. Im posting on this forum in the hope that some reader has encountered this issue before and can resolve it once and for all! Searching the web only returned one answer, and that was not a desirable one - wait for php6..

 

This issue only applies to Linux (Debian etch) and OSX, running PHP 5.2.4. The issues does not appear on Windows.

 

Im reading a local file structure which contains mp3's and then displaying them in the browser, where the user can click each track to add it to a playlist. There's a java mp3 player part of the app which runs on the same server to handle the audio.

 

The problem is that directories/files which include non-ascii characters aren't read correctly by PHP. Here is an example, the first 2 lines may be the same - if you copy-paste out into a text editor you will see the difference. The second line denotes the unicode decimals for each string, and the third line is built from HTML entities using the unicode values. This will enable your browser to display a representation what PHP reads from my disk. (edit: this forum wont let me put html into my post, so the 3rd line will not render!)

 

/mp3/Björk/

66 106 246 114 107

B j ö r k

 

Is read as:

 

/mp3/Björk/

B j o ̈ r k

 

It was my understanding that PHP would directly read binary data off the disk, where I could interpret it as whichever charset I desire. I believe OSX and Debian use utf-8 as their underlying charset. Converting this using mb_string in PHP doesnt work, and I am at a loss as to how this can be dealt with correctly.

 

I am using utf-8 encoding through out my application, this problem only exists when reading from the disk!

 

I can also provide some of my PHP test scripts should anyone like to attempt to solve this issue on their local machine..

 

Thanks for any/all help or suggestions.

mafro

 

Link to comment
Share on other sites

What you have is not UTF-8, but a latin small letter o with a combining diaeresis. To my knowledge combining marks are weakly supported at the moment, if not at all. You mentioned that PHP did not read the data correctly, but I would think the issue is with the display. This could be related to the capabilities of the application or the fonts being used.

Link to comment
Share on other sites

Thanks for that one effigy, so now I understand where PHP is getting the 6f+308 chars from.

 

The problem certainly isnt with the display of this data, but working why PHP wont read the data correctly. I have a php script saved in UTF-8 with a hardcoded 'Björk' in there. It displays fine with the correct Content-Type charset. Then I read from the disk, and this one is displayed wrong.

 

Im guessing that internally PHP doesnt handle strings with unicode.

 

Using the link you provided:

http://www.eki.ee/letter/chardata.cgi?ucode=f6

 

That is the character I expect to be read, but im getting it as two separate ones. I guessed this was because the char I wanted was 2 bytes wide!

 

Dyou think there is a way of converting the values im reading into the values I want?

 

Thanks again.

Link to comment
Share on other sites

I see the UTF-8 one. Because it was saved in UTF-8! This makes sense.

 

The attached file is saved from the browser once my script has run. It is encoded as UTF-8, as per the content-type of the page. The first line is my hardcoded string, the second is read from the disk.

 

mafro

 

[attachment deleted by admin]

Link to comment
Share on other sites

I found this in a PDF about unicode and PHP6. Can you explain how the sum works? Perhaps this can used to 'create' the correct character in my script.

 

From http://www.gravitonic.com/downloads/talks/php-quebec-2006/php-6-and-unicode.pdf:

 

Unicode is Generative

* Composition can create “new” characters

* Base + non-spacing (combining) character(s)

A + ° = Å

U+0041 + U+030A = U+00C5

a + ˆ + . = ậ

U+0061 + U+0302 + U+0323 = U+1EAD

a + ̢ + ̌ = ǎ̢

U+0061 + U+0322 + U+030C

 

Thanks for your help effigy.

Link to comment
Share on other sites

OK, but "Björk" is not only in your file, but also on the file system, correct? How was this folder created? I'm guessing not with UTF-8...

 

Update: For example (on Unix): prompt: \ls -d Bj*rk | od -cx

 

In regards to the PDF, I'm not sure. All I know is that a base character can be followed by any number of combining marks, and some how these are "magically" combined into one.

Link to comment
Share on other sites

Right right, I get you effigy - its quite possible that youre right there. Im working in OSX right now. Doing copy-paste in the OS is seeming to use UTF-8 for the encoding though!

 

I think ill write a little bit of java to see if it can read the disk in unicode correctly. Altho this is mostly an issue of understanding..

 

The main problem now, is how do they 'magically' combine the 2 values into one?

 

U+0041 + U+030A = U+00C5
Link to comment
Share on other sites

OSX has some Unix-like capabilities right? (I'm not a Mac guy.) You should be able to do something similiar to what I posted to see what's really there.

 

The main problem now, is how do they 'magically' combine the 2 values into one?

U+0041 + U+030A = U+00C5

 

It's my understanding that this is all done on the display/rendering/font side. It's also my understanding that it doesn't convert U+0041 + U+030A to U+00C5--one is not replaced with the other. All they're doing is demonstrating that characters can be precomposed or decomposed. No matter whether there's a 6f/308 or a c3/b6, PHP should read them and return them as is.

Link to comment
Share on other sites

Yeh OSX is a modified BSD under the hood.

elisha:/mp3 mafro$ ls -d Bj*rk | od -cx

ls: Bj*rk: No such file or directory

elisha:/mp3 mafro$ ls -d Bj* | od -cx

0000000    B  j  o 314 210  r  k  \n                               

            6a42    cc6f    7288    0a6b                               

0000010

 

Im not too sure what the above really means!

 

Some preliminary testing in Java is returning much the same result as PHP, so it seems I just need to find some way of interpreting these bytes into the correct character.

Link to comment
Share on other sites

Interesting. All of the letters are themselves, except following the "o" is the combining diaeresis in UTF-8.

 

So when you run an ls you see "Björk"?

 

To demonstrate that this isn't a PHP--or in this case, browser--problem, the following works for me in Firefox:

 

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<pre>
<?php
$combo_diar = pack('c*', 0xCC, 0x88);
echo 'Bjo' . $combo_diar . 'rk';
?>
</pre>

 

Once you read this information, where is it being output?

 

Link to comment
Share on other sites

Strangely, what I see when I run a straight ls is "Bjo??rk". So the 2 question marks must represent the 314 and 210 returned by od -cx. I have no idea what these values mean.

 

Running your script results in the original latin o, with the combining diaresis:

Björk

 

Ive been reading a lot about how unicode can represent these characters as either precomposed or decomposed - and the original reason I was confused was because I was expecting precomposed - it turns out OSX stores all these characters as decomposed internally.

 

If youre interested, this chap has a good rant about the issue in the context of OSX/Java:

http://lists.apple.com/archives/Java-dev/2006/Jul/msg00161.html

 

Out interest, how do you translate between 'utf8c=cc+88' and 'ucode=308'?

 

The problem isnt really with PHP. The reason im getting the 2 decomposed chars is OSX's fault, and somehow I need to get the browser to display these 2 chars combined, not treat them as separate values. Im going to revisit the architecture of my app and see if I can do all the filesystem work in Java, where I might be able to process these chars.. Altho it still seems to me that there should be a way of displaying these decomposed unicode chars with PHP?? Processing the chars into precomposed seems like the wrong approach..

 

Any thoughts effigy?

All you input has been much appreciated.

 

mafro

Link to comment
Share on other sites

Strangely, what I see when I run a straight ls is "Bjo??rk".

 

Is your locale set to UTF-8? The locale command should tell you.

 

So the 2 question marks must represent the 314 and 210 returned by od -cx. I have no idea what these values mean.

 

od performs an octal dump and the -cx switches indicate that you want to see the characters with their hex values as well. When numbers are shown in place of characters, these are actually octal values because the character is unable to be displayed. 0314 = 0xCC and 0210 = 0x88; 0xCC and 0x88 are the UTF-8 encoding for a combining diaeresis.

 

it turns out OSX stores all these characters as decomposed internally.

 

Is that true? I don't understand why they would do this.

 

Out of interest, how do you translate between 'utf8c=cc+88' and 'ucode=308'?

 

You can reference the Unicode Data, which includes decomposition mappings if they exist. This is a heap of data, so I would pick out the characters that you're using.

 

For instance, the composed form of "ö" is F6 and its record looks like this:

 

00F6;LATIN SMALL LETTER O WITH DIAERESIS;Ll;0;L;006F 0308;;;;N;LATIN SMALL LETTER O DIAERESIS;;00D6;;00D6

 

The 6th field contains the decomposition that you're seeing: 6F (o) and 308 (the combining diaeresis, which breaks down to CC 88 in UTF-8). You can use this to transform the character within your application. Keep in mind that there are compatibility mappings and canonical mappings. Canonical is defined as two different ways of defining the same symbol, where compatibility means that they're fundamentally similar; however, they may differ in their usage and rendering. As I understand it, these are indicated with <bracketed> information.

 

I need to get the browser to display these 2 chars combined, not treat them as separate values.

 

Which browser are you using? The example I created worked in Firefox, and web browsers are one of the few areas where combining characters are supported.

 

Altho it still seems to me that there should be a way of displaying these decomposed unicode chars with PHP??

 

Agreed, but PHP isn't doing the displaying--its output is being sent somewhere. And this is a browser, correct?

Link to comment
Share on other sites

Is your locale set to UTF-8? The locale command should tell you.

 

Here's the output from locale:

 

elisha:~ mafro$ locale

LANG=

LC_COLLATE="C"

LC_CTYPE="C"

LC_MESSAGES="C"

LC_MONETARY="C"

LC_NUMERIC="C"

LC_TIME="C"

LC_ALL="C"

 

I looked it up on this man page to make sense of it:

http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/locale.1.html

 

it turns out OSX stores all these characters as decomposed internally.

 

Is that true? I don't understand why they would do this.

 

Check this link out - i found it via the lists.apple.com link i previously posted:

http://developer.apple.com/qa/qa2001/qa1235.html

 

You can reference the Unicode Data, which includes decomposition mappings if they exist. This is a heap of data, so I would pick out the characters that you're using.

 

Thanks for that resource effigy, im sure that will be useful at some point. Id like to solve this the correct way however, with the display side.. Which bring us onto the main point:

 

Which browser are you using? The example I created worked in Firefox, and web browsers are one of the few areas where combining characters are supported.

 

Altho it still seems to me that there should be a way of displaying these decomposed unicode chars with PHP??

 

Agreed, but PHP isn't doing the displaying--its output is being sent somewhere. And this is a browser, correct?

 

Im using Firefox latest. And when I say 'displaying chars with PHP', displaying with the browser is what I really mean. You say you've done an example for testing? Could you post it for me please?

 

The thing im seeing is Firefox displaying the latin o and the diaresis as two separate chars - surely there is some encoding Firefox must employ to decide to compose these characters together?

 

Thanks for all the help

mafro

Link to comment
Share on other sites

Try setting your locale to UTF-8 then running the ls.

 

csh: setnev LC_ALL en_US.UTF-8

sh: LC_ALL=en_US.UTF-8 export LC_ALL

 

This example should be similar to what you're getting and sending to the browser.

 

When you see the faulty display in Firefox, go to View -> Character Encoding. What is bulleted there?

Link to comment
Share on other sites

I set the locale using the sh syntax you posted, and [ls] returns the same result as before (Bjo??rk).

 

On all my tests in Firefox the character encoding is UTF-8. Ive been setting that with a HTTP header and leaving a meta tag in there for good measure..

 

On one note, as i read file names off the disk i did run mb_detect_encoding() on the string thats returned. They all detect as ASCII apart from the Björk item which is detected as UTF-8.

Link to comment
Share on other sites

0000000    B  j  o 314 210  r  k  \n                               

            6a42    cc6f    7288    0a6b                               

0000010

 

I used shell_exec() to get the octal dump shown above. It's exactly the same output as running that on the command line.

 

In this post:

http://www.phpfreaks.com/forums/index.php/topic,164064.msg720106.html#msg720106

 

You say that it 'works' in firefox. Do you mean you get the correctly rendered o with diaresis? This script doesnt work at my end..

 

Thanks again effigy.

Link to comment
Share on other sites

Just did some testing under Debian. Results are as expected.. The following is the output from Scandir() then processed with the octal dump tool, followed by the output displayed in the browser with UTF-8 character encoding.

 

Debian:

0000000  B  j 366  r  k  \n

        6a42 72f6 0a6b

0000006

 

Bj�rk

 

The hex here means this:

42 6a f6 72 6b 0a

B j ö r k \n

 

OSX:

0000000    B  j  o 314 210  r  k  \n                               

            6a42    cc6f    7288    0a6b                               

0000010

 

Björk

 

And the hex here means this:

42 6a 6f cc88 72 6b 0a

B j o <diaresis> r k \n

 

This is good news for me, because it makes sense.. And I understand whats going on. So still the question stands about how to get it display correctly.. Under OSX it displays the two chars separately, and under Debian the ö is displayed as a question mark.

 

Edit: one last example. When I hardcode the string into my PHP script it displays correctly. The octal dump as above is this:

 

0000000  B  j 303 266  r  k  \n  \0

        6a42 b6c3 6b72 000a

0000007

 

Björk

 

And the hex means:

42 6a c3b6 72 6b 0a 00

B j <o-with-diaresis> r k \n

 

Progress?! Im guessing that last one is good because the script is encoded as UTF-8 when I save it, and the display in the browser is the same.

Link to comment
Share on other sites

Ok I found this link:

http://people.w3.org/rishida/scripts/uniview/conversion.php

 

This guy has javascript methods which convert between all the codepoints im interested in. Hex, Decimal, UTF-8. Since it's javascript, i could rewrite this in PHP (or whatever language) to help me convert strings into the correct encoding. Obviously, ill need to know what im converting from!

 

I guess id have to do testing across a whole bunch of characters on all platforms, to see the standard.

 

One more thing, testing this stuff on windows and using the standard iso-8859-1 charset, everything works fine. The result from Scandir() displays perfectly in the browser, which is because the filesystem uses that character set underneath I believe.

Link to comment
Share on other sites

Right I think i may have narrowed it all down now. Using the iso-8859-1 character set, I can get the result from scandir to display on both Windows and Debian.

 

This is the octal dump output, so you can see it's not UTF-8 multibyte encoded.

B  j 366  r  k

 

The only issue remaining in this case then is OSX support. This wont affect me since im only using OSX for development, the actual production version of this will run on Debian. So we're half solved at this stage! And ive learnt a lot about character encoding..

 

Thanks for all your help effigy, if you've got anymore comments id be interested to hear them.

mafro

Link to comment
Share on other sites

In this post:

http://www.phpfreaks.com/forums/index.php/topic,164064.msg720106.html#msg720106

 

You say that it 'works' in firefox. Do you mean you get the correctly rendered o with diaresis? This script doesnt work at my end..

 

Yes. I'm running Firefox 2.0.0.8 on Windows 2000 Professional. Where did this not work for you? OSX?

 

under Debian the ö is displayed as a question mark.

 

This is happening because you're instructing the browser to use UTF-8, but the data is not encoded in UTF-8.

 

Ok I found this link:

http://people.w3.org/rishida/scripts/uniview/conversion.php

 

This guy has javascript methods which convert between all the codepoints im interested in. Hex, Decimal, UTF-8. Since it's javascript, i could rewrite this in PHP (or whatever language) to help me convert strings into the correct encoding.

 

The encodings are correct, they're just different ways of approaching the same goal. All this script does it show you the breakdown of a character, which isn't much help in your case.

 

The lingering problem is the display of the character--not the encoding--which we've touched on. What browser are you using under OSX? Some additional reading has brought about the concept of normalization. This page suggests:

 

Then, on MacOS/MacOSX *only*, force all strings returned by

File.getName() to their NFC form. For example:

    String name = file.getName();

would be followed by:

    if (CommonUtils.isAnyMac()) name = UnicodeString.NFC(name):

This change would be performed throughout the code. An alternative approach

would be to add a method in CommonUtils, and replace instead the above first

line by:

    String name = CommonUtils.getFileName(file);

where the new method would call File.getName() and use the UnicodeString.NFC

converter. This will solve most of the issues related to the MacOS behavior

(that neither MRJ 2.5 for Mac OS 8/9, nor Java2 for Mac OSX correct for

now). But there will still be interoperability issues with other systems.

 

I ran across this for PHP.

Link to comment
Share on other sites

In this post:

http://www.phpfreaks.com/forums/index.php/topic,164064.msg720106.html#msg720106

 

You say that it 'works' in firefox. Do you mean you get the correctly rendered o with diaresis? This script doesnt work at my end..

 

Yes. I'm running Firefox 2.0.0.8 on Windows 2000 Professional. Where did this not work for you? OSX?

 

Yeh im running Firefox 2.0.0.8 on OSX and it displays the question mark. Thats pretty odd really, because we're packing the correct UTF-8 bytes for the diaresis - the browser seems to just want to render them as separate chars. On my windows box the same script in the same browser works as expected!

 

Anyway, I think the mailing list post you uncovered about limewire is basically the issue here - or rather another rendition of the same problem. And I suppose the real solution may lie in using that PEAR library to normalise all filenames read from the disk to their composed form. They can then be prob then be displayed in iso-8859-1 encoding.

 

Ill post some results.

Link to comment
Share on other sites

...the browser seems to just want to render them as separate chars. On my windows box the same script in the same browser works as expected!

 

This may be a font issue since the application versions are the same. I'm really not sure. Do you know which font Firefox is using by default on OSX?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.