Jump to content

Character encoding problem. Got strange characters in my database


Fluoresce

Recommended Posts

I submit text to my MySQL database using a form on my site. When I look at the text in my database, sometimes, there are strange characters. The strange characters are caused by quotation marks, em dashes, apostrophes and foreign letters of the alphabet. I think that this only happens when the source of the text is a Windows program. I understand that this is a character encoding issue, but I don't fully understand the subject. I've spent the last few hours researching it, but it's only confused me.

 

My site uses UTF-8 encoding:

<meta http-equiv="content-type" content="text/xml; charset=utf-8" />

The collation of my database is utf8_general_ci.

 

My form looks like this:

<form action="" method="post">
</form>

As you can see, an accept-charset="utf-8" attribute has not been specified.

 

 

Questions

 

1) I am guessing that my problem is that the Windows characters are being misinterpreted by my UTF-8 setup. Is that correct?

 

2) If so, is there a way that I can safely convert the Windows characters to UTF-8 during the submission process?

 

3) Should I also specify an accept-charset="utf-8" attribute on the form?

 

4) When I paste the Windows text directly into my database without using the form, the characters save without turning into the strange characters. But they don't render properly on my site. Can't browsers identify Windows characters?

Since you didn't say anything about the character encoding of the database connection, my guess is that this is your problem.

 

When you send data from the browser to some database, there are at least 3 encodings involved:

  • The encoding of the user input in the HTTP request.
  • The encoding of the database connection (MySQL must know how to interpret the incoming data).
  • The encoding of the MySQL table.

You've appearently covered the first and the last stage. No, you don't need an accept-charset attribute; it's enough to declare the encoding of the entire document.

 

If you've forgotten the second stage, then MySQL might misunderstand the incoming UTF-8 data (the default encoding is Latin-1) and store nonsense in your table. How to declare the encoding of the connection depends on the database interface you use. If you're still using the old mysql_* functions, then it's mysql_set_charset():

mysql_set_charset('utf8');

Do not use a SET NAMES query. While this also changes the encoding, it doesn't update the encoding information in the MySQL API and can break important functions like mysql_real_escape_string() entirely.

Thanks, Jacques1!

 

You were right.

 

I added:

mysql_set_charset("utf8", $connection);

and now the curly quotes, em dashes, etc., are submitted from my form to my database without being changed into strange characters.

 

Curly quotes, em dashes, etc., can now also be displayed on my web pages (before, they would render as small diamonds with question marks inside).

 

I used:

echo mysql_client_encoding($connection);

to check the character set of my MySQL connection. It said latin 1. Now it says utf 8.

 

I appreciate your help. ;D

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.