Jump to content

Mysql Unicode Problems


jamina1

Recommended Posts

Hi guys -

 

I have a problem with our website and various encodings. We have a chinese and an english website. We use a program called Zen-Cart for the english site and we just copied over the pertinent "info" pages and translated them into chinese instead of duplicating our entire product database.

 

The chinese pages are in utf-8. The english pages are in iso-8859-1.

 

Problem being whenever someone enters chinese into a form, it is processed by the zencart scripts, thus the encoding is swapped and the characters get screwed. Easy solution, convert the english page to utf-8 (which I want to do!)

 

Problem is that then   and other characters like the R, C and TM symbols start showing up funny on the now-UTF-encoded english pages.

 

Is there anyway to do a SELECT to find these weird entries and thus fix them BEFORE we change our page to unicode so that my boss doesn't freak out that half our pages will be messed up.

 

I know unicode will fix all our problems, I just need to figure out how to ensure our database is COMPLETELY unicode compliant (data was entered as english/lating/iso-whatever encoding)

 

So after all this rambling I need to know

1) Is there a way to find the entries in the database that are non compliant with UTF-8 so we can fix them?

2) Is there a way to convert the database and its contents (not just the coallation and charset) to utf-8? I've tried

UPDATE $table SET $column=CONVERT(CONVERT(CONVERT($column USING latin1) USING binary) USING utf8)

and

ALTER TABLE $table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci

to no effect.

Link to comment
Share on other sites

I'm not sure I understand... you have to specify the encoding of the connection, too.

 

Here's what we have. We have a database that was entered in latin encoding that is now UTF-8 Charset, and utf collation, but the data in the fields is the same. (when we changed the charset and collation the data was not converted)

The database connection is in utf-8, but some special characters, like non-breaking spaces, and other weird html entities show up as garbled text, diamond ???'s or just ?'s once we switch the pages they're brought up on from iso-5589-1 to utf-8.

 

I need to know, is there a way to find these entries so they can be fixed, or is there a way to convert the data entirely?

 

It isn't a problem if I *leave* the english pages as iso-5589-1 encoded, but that leaves my other more pertinent email problem outstanding, which the only solution I've found is to convert the english pages to utf-8..

Link to comment
Share on other sites

Is the database correct and the output wrong? or are both wrong?

 

Its in plain HTML in the database. It's when it's read out that it becomes wrong.

Say you have an   in there.

 

In the database it says   or © or ® or whatever.

 

When it reads out in iso-5589-1 encoding on an html page, its a big black square with a ? in it.

 

If I change the page to UTF-8, its just a ?.

 

We need to rectify the problems with the data in the database ( , registered symbols, tm symbols, r symbols weren't encoded properly when they were inserted... or something) so that I can make our pages uniformly UTF-8 encoding without the jibberish showing up and my boss freaking out.

 

If I go in and edit it using UTF-8 it sorts itself out, but I just need to know if there's a way to single these entries out, or wholly convert all the rows (in a database of about 30k rows) without just stumbling across them as I browse our products.

Link to comment
Share on other sites

And you've used "SET NAMES" in your mysql connection from php, and defined everything else correctly w.r.t encoding/

 

Yes, it just sort of changes what goes on with the improperly entered data.

 

See here on this page, the 2nd bullet point: http://www.testequipmentconnection.com/products/36938

And here on this page, where the page is UTF-8 encoded: http://70.86.88.202/~tec/products/36938

 

This is what has happened to the database. It was created, and filled with data completely in English - much of it copy and pasted from other sources. Quite a bit of it contains HTML code.

 

About 6 months ago, they decided it needed to be able to support chinese characters, which since it was currently in Latin encoding, it wasn't going to be able to do.

 

We changed the charset and coallation on the database, but I don't think it converted the data.

New data goes in as UTF-8 so it looks right. Old data is sort of a tossup with special characters.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.