Unicode Disk Reading

effigy · October 22, 2007

Actually, now I think about it I had pre tags in the example. Go to Tools->Options->Content->Advanced and see what's listed under "Monospace." You may want to try removing the pre tags to see if the results differ.

mafro · October 22, 2007

Just been playing with I18N_UnicodeNormalizer. If do a toNFC() with a charset of UTF-8 then I get the correct output.

After reading from the disk:

0000000 B j o 314 210 r k \n

6a42 cc6f 7288 0a6b

0000010

After using the I18N_UnicodeNormalizer class:

0000000 B j 303 266 r k \n

6a42 b6c3 6b72 000a

0000007

So there's a solution. Nice. If i set my encoding to ISO-8859-1, im getting the same output as the input however - it doesnt convert the o+314+210 into a 366. Cant have it all tho I suppose?!

mafro · October 22, 2007

Actually, now I think about it I had pre tags in the example. Go to Tools->Options->Content->Advanced and see what's listed under "Monospace." You may want to try removing the pre tags to see if the results differ.

It looks the same without the pre tags. And checking the monospace font was what I had done.. Courier it is.

effigy · October 22, 2007

If i set my encoding to ISO-8859-1, im getting the same output as the input

I'm confused. Why are you trying to use ISO-8859-1? Shouldn't everything be in UTF-8?

mafro · October 22, 2007

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. So it would be nice if I could have a minimal conversion routine for OSX, and then just use iso-8859-1. There's potential for these filenames (paths) being stored in database, so a uniform charset across the board is good.

What dyou think? Just use UTF-8 on all platforms? To be honest, it'd be nice to support OSX but im not going to create myself loads of extra work. Im just glad we've found a work around for future reference.

effigy · October 22, 2007

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1.

I'm not sure how Windows works, but, at a basic level, I know you can change your locale and character set on Linux. I can change to the "en_US.UTF-8" locale and it will display whatever it reads as UTF-8. From a programming standpoint, you should know what encoding is there, read it, then tell the browser what it's getting.

Based on what I've seen, your data is stored in UTF-8. Are you saying all 3 operating systems are storing (encoding) the same name differently?

mafro · October 23, 2007

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required.

No, im saying that both Windows and Debian encode filenames in iso-8859-1. OSX is the oddity, due to the decomposed chars issue.

mafro · October 23, 2007

For reference:

http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations

effigy · October 23, 2007

Windows... encode filenames in iso-8859-1

Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "ï»¿" in most text editors and web browsers not prepared to handle UTF-8.

In others words, the file is encoded in UTF-8. U+FEFF breaks down to EF BB BF in UTF-8, and these just happen to exist in ISO-8859-1 because they don't exceed FF. If an application does not properly read the byte sequence or does not support UTF-8, then the fall back is ISO-8859-1, which will render ï»¿.

This has nothing to do with the file name, but the actual contents.

Which version of Windows are you using?

mafro · October 24, 2007

Im on XP SP2 here. More what I was getting at with that reference was the bit on OSX - best and most precise description ive found of what started this whole thread. Im gonna do a little testing on all platforms now, where I try to pack() some UTF-8 bytes in PHP. Ill post my results.

effigy · October 24, 2007

You could also look for combining marks to process:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<?php
$tests = array(
	### UTF-8 composed.
	'Bj' . pack('c*', 0xC3, 0xB6) . 'rk',
	### UTF-8 decomposed.
	'Bjo' . pack('c*', 0xCC, 0x88) . 'rk',
);
foreach ($tests as $test) {
	echo "$test => ", preg_match('/\p{M}/u', $test) ? 'Has Mark' : 'Does not have Mark' ;
	echo '<br>';
}
?>

For me this outputs:

Björk => Does not have Mark

Björk => Has Mark

mafro · October 25, 2007

OK I think ive got this solved now. The attached files show results from OSX and Windows - the interesting thing here is that you can see the Windows one has rendered the diaresis over the preceding character in the second example. If you look closely you can see the difference in how it renders comparing a precomposed to a decomposed utf-8 o-with-diaresis.

Based on everything ive learnt I wrote a final script to test all this, which will read files from the disk and then convert them to UTF-8. For OSX it also normalises decomposed unicode into precomposed, using the PEAR I18N_UnicodeNormalizer lib. After that I do a quick DB test to insert and retrieve the data from a UTF-8 field in MySQL.

This script works as expected on all platforms. Case solved?

I included my script below for anyone else who reads this thread and might want a code example to play with.

Cheers for all the input effigy. I think with the tools/knowledge ive picked up ill be able to solve these kind of prob myself in the future! Happy days.

mafro

<?php

//config creates my DB connection
include_once("config.php");

//set paths here for each different OS
if(strtolower(substr(PHP_OS,0,3)) == 'win') {
$root = "e:\\www\\mp3\\ptest\\";
}else if(strtolower(substr(PHP_OS,0,6)) == 'darwin') {
$root = "/Users/mafro/Sites/mp3/ptest/";
}else{
$root = "/var/www/mp3/ptest/";
}

header("Content-Type: text/html; charset=utf-8");
echo "Content-Type: text/html; charset=utf-8<br/>";
echo PHP_OS.'<br/><br/>';

//read file names from disk
$files = scandir($root);
foreach($files as $file) {
if(substr($file,0,1) != ".") {
	$test = $file;
	break;
}
}

echo $test."<br/><br/>";

//convert filenames to utf-8

if(strtolower(substr(PHP_OS,0,6)) == 'darwin') {
//normalise and convert OSX
$norm = new I18N_UnicodeNormalizer();
$conv = $norm->toNFC($test, 'UTF-8');
}else{
//just convert to utf-8
$conv = mb_convert_encoding($test, "UTF-8", "ISO-8859-1");
}

echo $conv."<br/><br/>";


//database test - insert record into UTF-8 field in DB and then attempt to retrieve it
$db =& Registry::GetDBConnection();

$sql = "delete from library where hash = 'test'";
$db->query($sql);

$sql = "insert into library (hash, filename) values ('test', '".$conv."')";
$db->query($sql);

$sql = "select filename from library where hash = 'test'";
$res = $db->query($sql);
$row = $res->fetchRow();
echo $row['filename']."<br/><br/>";

?>

[attachment deleted by admin]

effigy · October 25, 2007

Sounds good.

Even though you haven't encountered combining marks in Linux or Windows yet, they can still be used. The better solution may be the mark-searching route, rather than an OS-specific one. That is, if you detect a mark, send it through the Unicode Normalizer.

mafro · October 25, 2007

Good point. Is there a defined range for combining marks? If there is then mark-detecting would be easy. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it.

To summarize for future readers:

* filenames are stored on the disk in a variety of character encodings. Windows is a latin1, OSX is MacRoman, which also encodes all chars that have combining marks into their unicode 'decomposed' form.

Read Mac OSX section here:

http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations

* the sensible option is to encode everything as utf-8, and normalize decomposed characters where necessary.

Resources (2 from effigy's sig!):

http://www.joelonsoftware.com/articles/Unicode.html

http://en.wikipedia.org/wiki/Precomposed_character

http://czyborra.com/utf/#UTF-8

http://www.phpwact.org/php/i18n/charsets

http://www.eki.ee/letter/

effigy · October 25, 2007

Windows is a latin1

This isn't true for all versions:

NTFS allows any sequence of short (16-bit) values for name encoding (file names, stream names, index names, etc.). This means UTF-16 codepoints are supported,

Source: NTFS.

Is there a defined range for combining marks?

Yes.

Combining Diacritical Marks (0300-036F)

Combining Diacritical Marks for Symbols (20D0-20FF)

Combining Half Marks (FE20-FE2F)

These should all be caught by the regular expression example I posted.

I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it.

Correct. This might be worth testing/benchmarking--the regex vs. always normalizing.

Sign In

Unicode Disk Reading

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information