Jump to content

Character sets revisited


erikla

Recommended Posts

I have had a thread here recently titled: Special Language characters doesn't display correctly. Now however, I have encountered a new problem with character sets, that I hope you can help me resolve! I find the concept of character sets quite confusing. When you are dealing with a database and want the content to be displayed on a web page, you obviously want it to be displayed correctly, and if you are not from an English speaking country, this could cause you some problems here and there due to international characters. As I have understood, one need to tell how the data (text) is stores in the database, how it is being interpreted (collation) and how it is being displayed on the web page.

 

Here is my situation:

One php page is displaying the content of a guestbook: show_content. This php file is divided into three parts: An include of a header.html file, a middle section displaying the content of the database and finally an include of a footer.html file to take care of the lower part of the page. Now, the header.html file does contain the script for a dropdown menu written in javascript:

 

header.html

<html>

<head>
  <title>Guestbook</title>
  <meta http-equiv="Content-Language" content="da">
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

  <LINK REL=stylesheet TYPE="text/css" HREF="generel/pop_style.css">
  <LINK REL=stylesheet TYPE="text/css" HREF="generel/hoved.css">
</HEAD>

<BODY leftmargin=0 topmargin=0 marginwidth=0 marginheight=0 background="generel/univback.gif">

<script type="text/javascript" language="JavaScript1.2" src="generel/pop_core.js"></script>
<script type="text/javascript" language="JavaScript1.2" src="generel/pop_data.js"></script>
<script type="text/javascript" language="JavaScript1.2" src="generel/pop_events.js"></script>

<!--etc .......

The main page controlling this header:

 

show_content.php

<?php

function handle_fatal_errors() {
	if (http_response_code() == 500) {echo 'A technical error ocurred.';}
}

register_shutdown_function('handle_fatal_errors');	
ini_set('display_errors', 0);

include("header.html");
include("settings.php");
header('Content-Type: text/html; charset=utf-8');

//etc ...

The footer.html file doesn't really matter here, so I avoid it.

 

Now the strange thing: When I execute the show_content.php file above, the content of the database (guestbook) display perfectly, but the International letters in the dropdown menu don't display properly. BUT if I out-comment the following line:   

header('Content-Type: text/html; charset=utf-8');

in the show-content.php file it is the opposite way around!! Then the items in the dropdown menu display properly while the content of the guestbook don't!

 

By the way: When I connect to the database to store items, I do it with the following code:

$db = new PDO('mysql:host='.$server.';dbname='.$database.';charset=utf8mb4', $username, $password);

If somebody have an idea for an explanation of this behavior, I will be happy to hear about it! To me all these character set issues seem rather overwhelming. Maybe one setting override the other or what?

 

NB! I also read that the people around the developement of PHP temporarily gave up on Unicode support for a PHP 6.0 version ...

 

Erik

Edited by erikla
Link to comment
Share on other sites

If you're db is in UTF-8, why are you using the windows character set in your

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

 

Basically stick to one character set and use it consistently everywhere. In the db, headers, code, etc. And translating from one set to another rarely works 100% of the time, so it's best to avoid doing that if possible.  Sometimes you have to convert though, like you're retrieving data from an outside RSS feed that is encoded differently, but for your internal data, it should be consistent.

Link to comment
Share on other sites

This is not a problem of conflicting declarations, because the meta element is overriden by the HTTP header (of course it's still wrong).

 

The problem is that the text in your dropdown menu actually is stored as Windows-1252 or similar, whereas the rest of your application uses UTF-8. This is an issue of your editor. You need to make it use UTF-8, and you have to convert all existing files (HTML, JavaScript, CSS) into UTF-8. Having the database use a certain encoding is one thing, but of course your source files need to use the same encoding as well.

 

There are several other issues:

 

First of all, the HTML markup is hopelessly outdated. I'd say this was written before the year 2000. JavaScript 1.2? That was back in 1997 when people used Netscape Navigator and Internet Explorer 4. Quite a lot has changed since then.

  • We no longer use HTML for styling. All style attributes like leftmargin, background were removed a long time ago in favor of CSS.
  • The language attribute is no longer used. Depending on which flavor of HTML you use, you need the type attribute or nothing at all (HTML5 uses JavaScript by default). And of couse the era of JavaScript 1.2 is long gone.
  • You're missing the DOCTYPE declaration. This makes the document invalid and triggers the compatibility mode of the browser. As a result, your users may see strange layout errors.
  • The meta element for the character encoding should be at the very top of the head element. Otherwise you risk that the browser doesn't find it at all.
  • It might be a good idea to switch to HTML5. For example, the meta element for the encoding now has a short form: <meta charset="utf-8">.

Anyway, the bottom line is: You need a better resource for HTML. The Mozilla Developer Network has excellent, up-to-date information. It's also strongly recommended that you run your HTML through the W3C validator to see if it's even correct. The small snippet above already gives me 5 errors.
 

 

 

NB! I also read that the people around the developement of PHP temporarily gave up on Unicode support for a PHP 6.0 version ...

 

The PHP 6 project failed, and the whole Unicode discussion is somewhat misleading.

 

PHP is already perfectly capable of handling Unicode data (as you can see). What the PHP 6 people tried to do is improve and extend the support. For example, we currently need the Mbstring extension if we want to do string operations with Unicode strings (getting the n-th character, getting the length, getting a substring etc.). It would have been nice to have this in the PHP core. But, well, we won't.

Link to comment
Share on other sites

Thanks for the input, CroNiX and Jacques1 for the many details. I appreciate it.

 

To be honest I know very little about the meaning of the special declarations accompanying the HTML file. I don't really know the difference between XHTML, HTML 5 and adding "transitional", etc. Until now I have just produced a webpage by opening a standard HTML file in a WYSIWYG program (Namo). I am perfectly sure, Jacques1, that you don't like that approach at all - not controlling every tiny detail. At the moment I don't have the energy and time to make a lot of changes in this regard on every page in my website, but in the future it is certainly something I should be aware of. I like your links. Would it be appropriate to change my header.html to the following (taken from one of the pages you linked to!):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>Guestbook</title>  

  <LINK REL=stylesheet TYPE="text/css" HREF="generel/pop_style.css">
  <LINK REL=stylesheet TYPE="text/css" HREF="generel/hoved.css">
</HEAD>

<BODY leftmargin=0 topmargin=0 marginwidth=0 marginheight=0 background="generel/univback.gif">

<script type="text/javascript" language="JavaScript1.2" src="generel/pop_core.js"></script>
<script type="text/javascript" language="JavaScript1.2" src="generel/pop_data.js"></script>
<script type="text/javascript" language="JavaScript1.2" src="generel/pop_events.js"></script>

<!--etc .......

What I am worried about is that the initial HTML declaration changes might have the affect that old code doesn't display properly anymore. I should certainly learn more about these declarations in the future. I will however never be able to make any bigger changes in the javascript files controlling the dropdown menu. I have downloaded it from a webpage (Twinhelix.com) and customized it. It is simply too complicated - A very advanced dropdown menu! 

 

 

... and you have to convert all existing files (HTML, JavaScript, CSS) into UTF-8.

 

This could be possible, but how do I change these files to UTF-8?

 

Until now I have never had a problem with character sets on my web hotel, and when I upload my changes to the web server, the problems I have locally might even disappear. But for sure it is not satisfying not to know how to control it. As I mentioned in another thread my only page in English is here:

 

http://www.matematiksider.dk/enigma_eng.html

 

All of the rest is in my native language, which is Danish. The good thing is, that so far I have succeeded in transforming my old guestbook code to PDO and making other changes. I feel I am adapting to PHP/MySQL better and better. I will soon upload my changes to my web-hotel ...

 

 

Erik

Edited by erikla
Link to comment
Share on other sites

There's nothing inherently wrong with WYSIWYG editors. The problem is that tend to produce absolutely awful markup which no sane human would ever write. I don't know why that is, but it seems to be an universal rule.

 

Regarding XHTML vs. HTML: Just use plain HTML. Keep away from XHTML unless you know exactly what you're doing. It's a very special, very strict implementation of HTML and has specific requirements. Your document above has nothing to do with XHTML, so it's just incorrectly declared HTML which forces the browser into compatibility mode.

 

A basic HTML template looks like this:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>Your title</title>
        <link rel="stylesheet" href="your_style.css">
        <script src="your_script.js"></script>
    </head>
    <body>
        <h1>Some title</h1>
        <p>
            Welcome to my site!
        </p>
    </body>
</html>
 

What I am worried about is that the initial HTML declaration changes might have the affect that old code doesn't display properly anymore.

 

It's actually the other way round: Without a DOCTYPE declaration, you're forcing the browser into compatiblity mode, which heavily affects the layout. With a proper DOCTYPE, you get the standard mode.

 

 

 

This could be possible, but how do I change these files to UTF-8?

 

If you don't know how to transcode files your editor, you can use Notepad++.

 

 

 

I will soon upload my changes to my web-hotel ...

 

You mean “hoster”, not “hotel”. ;)

 

 

Link to comment
Share on other sites

Thanks Jacques1 for the basic HTML template. I like to get it explicit like here! I will try it out.

 

I found a nice page about the issue of character sets her: Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases

 

 

 

If you don't know how to transcode files your editor, you can use Notepad++.

 

 

I guess I found out how to watch and edit the encoding in Dreamweaver: Using the keyboard shortcut Ctrl+J. 

 

 

Seems like the issues with character sets is in website development what color management is in image processing: A mess, or at least hard to comprehend. I assume that the pain with character sets has its roots in history: That the web teams didn't choose the right approach at the very beginning, so it has been a "stop-gap solution" or kludge.

 

Erik

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.