Jump to content


Photo

Storing unicode characters in an XML file


  • Please log in to reply
5 replies to this topic

#1 elg2001

elg2001
  • Members
  • PipPip
  • Member
  • 11 posts

Posted 13 May 2006 - 07:10 AM

Hi,
I've done alot of looking into this. Wikipedia's website sends UTF-8 text and can embed unicode characters in a page (such as a greek delta symbol). However, when I try to accomplish the same thing in PHP, when I submit a delta symbol Δ using a text area and an html form with method="post", it gets stored in an XML file as Δ. Basically, non-english language characters show up as garbage. The code to store the <textarea>'s content is as follows:

$body = $doc->createElement('body');
$bodytext = $doc->createTextNode(utf8_encode(str_replace('  ', '&nbsp;&nbsp;', str_replace("\n", '<br />', str_replace("\r", '<br />', str_replace("\r\n", '<br />', htmlentities(stripslashes($_POST['body']))))))));
$body->appendChild($bodytext);
$post->appendChild($body);
$doc->documentElement->insertBefore($post, $doc->documentElement->firstChild);
$doc->formatOutput = false;
$doc->save($fPath);

I'm using the utf8_encode() function because without it there, PHP throws an exception that the submitted character is not a valid XML character. The XML file's encoding is UTF-8, declared as follows:

$doc = new DOMDocument('1.0', 'UTF-8');

Can anyone steer me in the right direction?

#2 toplay

toplay
  • Staff Alumni
  • Advanced Member
  • 973 posts

Posted 13 May 2006 - 01:12 PM

The default for XML is UTF-8. Have you remembered to set the UTF-8 character encoding in the output html?

[a href=\"http://www.w3.org/International/O-charset.html\" target=\"_blank\"]http://www.w3.org/International/O-charset.html[/a]

You could temporarily force your browser to display page in UTF-8 encoding by going into the "View" menu and clicking something like "Character Encoding" and selecting "Unicode (UTF-8)".

Also, the client computer must have the correct fonts installed to display that particular language (especially Japanese, Chinese, Russian, etc.).

See if it displays correctly after addressing these points I've mentioned.

Use arrarys in the str_replace() like so:
$bodytext = $doc->createTextNode(utf8_encode(str_replace(array('  ', "\n", "\r", "\r\n"), array('&nbsp;&nbsp;', '<br />', '<br />', '<br />'), htmlentities(stripslashes($_POST['body'])))));


#3 elg2001

elg2001
  • Members
  • PipPip
  • Member
  • 11 posts

Posted 13 May 2006 - 08:05 PM

that didnt fix it, but you did help me point out a syntax error in my <meta> tag :)

i'll use a delta symbol Δ as an example. after i submit it, when i ssh into the linux server and then examine the .xml file the text is stored in, the delta symbol gets stored as &amp;Icirc;Â^Ô

so my guess is there's some problem with either transmitting the character to the server, or the server-side php script is handling the delta symbol incorrectly.

There must be something I can do to fix it because wikipedia.com uses php and is able to transmit delta symbols just fine. What can I do?

#4 toplay

toplay
  • Staff Alumni
  • Advanced Member
  • 973 posts

Posted 13 May 2006 - 09:09 PM

You might have to specify the charset (third argument) when calling htmlentities() since it defaults to ISO-8859-1 and I don't believe the greek delta (Δ - U+0394) is part of that charset. It might be getting messed up right from the start.

[a href=\"http://us2.php.net/manual/en/function.htmlentities.php\" target=\"_blank\"]http://us2.php.net/manual/en/function.htmlentities.php[/a]

You have to be very careful when looking at UTF-8 charset files with any utilities. So, I don't know what you're doing once in your server through SSH. As you know a UTF-8 character can be stored in 1 to 4 bytes. Certain utilities don't support UTF-8 or will automatically do conversion from one character set to another (without you knowing it).

Take input, encode to UTF-8, and save. Then, read and display (specifying UTF-8 charset). Let that be your simplest test. Don't view or edit the file directly with any tools unless you know 100% that they support UTF-8. Watch what PHP functions you use on UTF-8 data. Use mbstring extension (functions that start with mb_), see:

[a href=\"http://us2.php.net/manual/en/ref.mbstring.php\" target=\"_blank\"]http://us2.php.net/manual/en/ref.mbstring.php[/a]

Also, if you copy/FTP a UTF-8 charset based file, it must be copied in binary mode and not ASCII. Just another thing that can trip you up and conversion happens from one character set to another automatically.

Try other characters than the delta.

[a href=\"http://www.macchiato.com/unicode/Unicode_transcriptions.html\" target=\"_blank\"]http://www.macchiato.com/unicode/Unicode_transcriptions.html[/a]

[a href=\"http://acharya.iitm.ac.in/demos/unicode_testview.html\" target=\"_blank\"]http://acharya.iitm.ac.in/demos/unicode_testview.html[/a]

[a href=\"http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt\" target=\"_blank\"]http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt[/a]


FYI - for all:

A little old 2003 article (says UTF-8 can be up to 6 bytes but that's inaccurate):
[a href=\"http://www.joelonsoftware.com/articles/Unicode.html\" target=\"_blank\"]http://www.joelonsoftware.com/articles/Unicode.html[/a]

[a href=\"http://icu.sourceforge.net/docs/papers/forms_of_unicode/\" target=\"_blank\"]http://icu.sourceforge.net/docs/papers/forms_of_unicode/[/a]

[a href=\"http://www.utf-8.com/\" target=\"_blank\"]http://www.utf-8.com/[/a]

Delta char specified in:
[a href=\"http://www.unicode.org/charts/PDF/U0370.pdf\" target=\"_blank\"]http://www.unicode.org/charts/PDF/U0370.pdf[/a]

[a href=\"http://www.unicode.org/charts/\" target=\"_blank\"]http://www.unicode.org/charts/[/a]

[a href=\"http://www.unicode.org/charts/symbols.html\" target=\"_blank\"]http://www.unicode.org/charts/symbols.html[/a]

[a href=\"http://www-950.ibm.com/software/globalization/icu/demo/unicode\" target=\"_blank\"]http://www-950.ibm.com/software/globalizat...cu/demo/unicode[/a]

[a href=\"http://www.fileformat.info/info/unicode/utf8test.htm\" target=\"_blank\"]http://www.fileformat.info/info/unicode/utf8test.htm[/a]

[a href=\"http://decodeunicode.org\" target=\"_blank\"]http://decodeunicode.org[/a]

#5 elg2001

elg2001
  • Members
  • PipPip
  • Member
  • 11 posts

Posted 15 May 2006 - 12:36 AM

thanks! I've made some progress because of your help. I had to add the ENT_COMPAT, and 'UTF-8' parameters when calling the htmlentities function. now the delta character works. however, certain other characters from the "Character Map" program do not show up correctly, such as russian characters.

here's the line of code in question after the changes:
$bodytext = $doc->createTextNode(str_replace(array("\r\n", "\r", "\n", '  '), array('<br />', '<br />', '<br />', '&nbsp;&nbsp;'), htmlspecialchars(stripslashes($_POST['body']), ENT_COMPAT, 'UTF-8')));

for whatever reason, if i add utf8_encode, the characters get all messed up. but now i think htmlspecial chars is taking care of changing the encoding to UTF-8, so doing it a second time with utf8_encode is the reason for the garbage characters?

#6 toplay

toplay
  • Staff Alumni
  • Advanced Member
  • 973 posts

Posted 15 May 2006 - 01:17 AM

FYI: I don't know when you last read my previous post but I have updated it to include more comments and links.

Anyway, I don't know if htmlentities/htmlspecialchars() is really converting anything because of specifying UTF-8. I'm pretty sure that just tells it what encoding the input string ($_POST['body']) is in. So, by specifying UTF-8, the data better be in that encoding to begin with or the result will be garbage (out). And that's probably why the utf8_encode() doesn't work.

I'd step backwards and take it one step at a time (to debug this). For instance, check what encoding is really coming from the form in $_POST['body'] using mb_detect_encoding() and display it. If the data is really in UTF-8, then you shouldn't use str_replace() but use mb_ereg_replace() or mb_eregi_replace().

You can always convert encodings back and forth using mb_convert_encoding().

Make sure you have the right fonts installed on the client machine you're testing on. For Windows users with NT 4.0/2000/XP already have such standard Windows-1251 (Cyrillic) fonts active for Russian. Also, make sure the browser is showing the page to you in UTF-8.

Good luck.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users