Jump to content

Unicode Disk Reading


mafro

Recommended Posts

Just been playing with I18N_UnicodeNormalizer. If do a toNFC() with a charset of UTF-8 then I get the correct output.

 

After reading from the disk:

0000000    B  j  o 314 210  r  k  \n                               

            6a42    cc6f    7288    0a6b                               

0000010

 

After using the I18N_UnicodeNormalizer class:

0000000    B  j 303 266  r  k  \n                                   

            6a42    b6c3    6b72    000a                               

0000007

 

So there's a solution. Nice. If i set my encoding to ISO-8859-1, im getting the same output as the input however - it doesnt convert the o+314+210 into a 366. Cant have it all tho I suppose?!

Link to comment
Share on other sites

Actually, now I think about it I had pre tags in the example. Go to Tools->Options->Content->Advanced and see what's listed under "Monospace." You may want to try removing the pre tags to see if the results differ.

 

It looks the same without the pre tags. And checking the monospace font was what I had done.. Courier it is.

Link to comment
Share on other sites

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required. So it would be nice if I could have a minimal conversion routine for OSX, and then just use iso-8859-1. There's potential for these filenames (paths) being stored in database, so a uniform charset across the board is good.

 

What dyou think? Just use UTF-8 on all platforms? To be honest, it'd be nice to support OSX but im not going to create myself loads of extra work. Im just glad we've found a work around for future reference.

Link to comment
Share on other sites

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1.

 

I'm not sure how Windows works, but, at a basic level, I know you can change your locale and character set on Linux. I can change to the "en_US.UTF-8" locale and it will display whatever it reads as UTF-8. From a programming standpoint, you should know what encoding is there, read it, then tell the browser what it's getting.

 

Based on what I've seen, your data is stored in UTF-8. Are you saying all 3 operating systems are storing (encoding) the same name differently?

Link to comment
Share on other sites

Well Debian and Windows work as standard reading filenames off the disk and displaying them in iso-8859-1. No conversion required.

 

No, im saying that both Windows and Debian encode filenames in iso-8859-1. OSX is the oddity, due to the decomposed chars issue.

Link to comment
Share on other sites

Windows... encode filenames in iso-8859-1

 

Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8.

 

In others words, the file is encoded in UTF-8. U+FEFF breaks down to EF BB BF in UTF-8, and these just happen to exist in ISO-8859-1 because they don't exceed FF. If an application does not properly read the byte sequence or does not support UTF-8, then the fall back is ISO-8859-1, which will render .

 

This has nothing to do with the file name, but the actual contents.

 

Which version of Windows are you using?

 

 

Link to comment
Share on other sites

Im on XP SP2 here. More what I was getting at with that reference was the bit on OSX - best and most precise description ive found of what started this whole thread. Im gonna do a little testing on all platforms now, where I try to pack() some UTF-8 bytes in PHP. Ill post my results.

Link to comment
Share on other sites

You could also look for combining marks to process:

 

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<?php
$tests = array(
	### UTF-8 composed.
	'Bj' . pack('c*', 0xC3, 0xB6) . 'rk',
	### UTF-8 decomposed.
	'Bjo' . pack('c*', 0xCC, 0x88) . 'rk',
);
foreach ($tests as $test) {
	echo "$test => ", preg_match('/\p{M}/u', $test) ? 'Has Mark' : 'Does not have Mark' ;
	echo '<br>';
}
?>

 

For me this outputs:

 

Björk => Does not have Mark

Björk => Has Mark

Link to comment
Share on other sites

OK I think ive got this solved now. The attached files show results from OSX and Windows - the interesting thing here is that you can see the Windows one has rendered the diaresis over the preceding character in the second example. If you look closely you can see the difference in how it renders comparing a precomposed to a decomposed utf-8 o-with-diaresis.

 

Based on everything ive learnt I wrote a final script to test all this, which will read files from the disk and then convert them to UTF-8. For OSX it also normalises decomposed unicode into precomposed, using the PEAR I18N_UnicodeNormalizer lib. After that I do a quick DB test to insert and retrieve the data from a UTF-8 field in MySQL.

 

This script works as expected on all platforms. Case solved?

 

I included my script below for anyone else who reads this thread and might want a code example to play with.

 

Cheers for all the input effigy. I think with the tools/knowledge ive picked up ill be able to solve these kind of prob myself in the future! Happy days.

mafro

 

 

<?php

//config creates my DB connection
include_once("config.php");

//set paths here for each different OS
if(strtolower(substr(PHP_OS,0,3)) == 'win') {
$root = "e:\\www\\mp3\\ptest\\";
}else if(strtolower(substr(PHP_OS,0,6)) == 'darwin') {
$root = "/Users/mafro/Sites/mp3/ptest/";
}else{
$root = "/var/www/mp3/ptest/";
}

header("Content-Type: text/html; charset=utf-8");
echo "Content-Type: text/html; charset=utf-8<br/>";
echo PHP_OS.'<br/><br/>';

//read file names from disk
$files = scandir($root);
foreach($files as $file) {
if(substr($file,0,1) != ".") {
	$test = $file;
	break;
}
}

echo $test."<br/><br/>";

//convert filenames to utf-8

if(strtolower(substr(PHP_OS,0,6)) == 'darwin') {
//normalise and convert OSX
$norm = new I18N_UnicodeNormalizer();
$conv = $norm->toNFC($test, 'UTF-8');
}else{
//just convert to utf-8
$conv = mb_convert_encoding($test, "UTF-8", "ISO-8859-1");
}

echo $conv."<br/><br/>";


//database test - insert record into UTF-8 field in DB and then attempt to retrieve it
$db =& Registry::GetDBConnection();

$sql = "delete from library where hash = 'test'";
$db->query($sql);

$sql = "insert into library (hash, filename) values ('test', '".$conv."')";
$db->query($sql);

$sql = "select filename from library where hash = 'test'";
$res = $db->query($sql);
$row = $res->fetchRow();
echo $row['filename']."<br/><br/>";

?>

 

[attachment deleted by admin]

Link to comment
Share on other sites

Sounds good.

 

Even though you haven't encountered combining marks in Linux or Windows yet, they can still be used. The better solution may be the mark-searching route, rather than an OS-specific one. That is, if you detect a mark, send it through the Unicode Normalizer.

Link to comment
Share on other sites

Good point. Is there a defined range for combining marks? If there is then mark-detecting would be easy. I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it.

 

To summarize for future readers:

* filenames are stored on the disk in a variety of character encodings. Windows is a latin1, OSX is MacRoman, which also encodes all chars that have combining marks into their unicode 'decomposed' form.

Read Mac OSX section here:

http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations

 

* the sensible option is to encode everything as utf-8, and normalize decomposed characters where necessary.

 

Resources (2 from effigy's sig!):

http://www.joelonsoftware.com/articles/Unicode.html

http://en.wikipedia.org/wiki/Precomposed_character

http://czyborra.com/utf/#UTF-8

http://www.phpwact.org/php/i18n/charsets

http://www.eki.ee/letter/

 

Link to comment
Share on other sites

Windows is a latin1

 

This isn't true for all versions:

 

NTFS allows any sequence of short (16-bit) values for name encoding (file names, stream names, index names, etc.). This means UTF-16 codepoints are supported,

 

Source: NTFS.

 

Is there a defined range for combining marks?

 

Yes.

Combining Diacritical Marks (0300-036F)

Combining Diacritical Marks for Symbols (20D0-20FF)

Combining Half Marks (FE20-FE2F)

These should all be caught by the regular expression example I posted.

 

I suppose if the normalizer is intelligent enough then it can probably 'not normalize' strings that dont need it.

 

Correct. This might be worth testing/benchmarking--the regex vs. always normalizing.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.