Collations, Charset's, Encoding and everything in between

kensaggy · October 31, 2007

Hello fello php programmers,

I have a problem with mysql's collations and encoding regarding utf-8 and hebrew texts.

I'm now working on a running site that has a table which contain a char(60) field.

the fields and the table and collation latin1_swedish_ci for some reason - and i say "for some reason" because the site is running utf-8 and the text value is hebrew.

now because of that, when someone enters text that comes close to 60 chars the utf-8 encoding breaks and i get the wierd ? symbol (? in a black dimond box).

i need to change the collations to utf-8 unicode and somehow recode the values... i thought of building a php scripts that:

1. runs over all the tables

2. changes the collations

3. selected the field

4. converts the encoding and updates' the row.

is this the correct way of solving this problem?

all my attempts were a failure :-( (tried using both iconv, and mbstring_convert_encoding). PHP already recognizes the text i select from the table as UTF-8...

what can/should i do to fix this? any idea's?

Thanks,

Ken.

fenway · October 31, 2007

First, you need to use SHOW TABLE STATUS to see which fields are not utf8. The truncation occurs because you're dealing with multi-btye characters so 60 isn't really 60.

kensaggy · November 1, 2007

i know which fields i want to change (or rather - which fields have text in them, the rest are numbers)..

i have two table i need to deal with, the first tbl_topics, and tbl_posts. on _topics table only the topic_title fields is problematic with a defenition of char(60) - which like you said, isn't really 60...

and _posts is a bit more problematic because it contains one field which is varchar(120) and one text field.

now i don't really understand all this charset's and encoding buisness and i wish i did...

but why is the text breaking up?

i guess my question is : how can i fix the current rows and how can i change the table to avoid this in future rows?

Thanks for your patience,

Ken.

aschk · November 1, 2007

You already have a problem that you can't reverse. When the text was originally entered the database kindly converted it all into your default charset and encoding that you were using at the time (latin1_swedish_ci), thus wiping all the extra information above the byte number that UTF-8 utilises.

What has happened is that the conversion has been looking for characters and NOT the binary representations of such. So it's wiped out bytes above 2 (which contains all your hebrew characters), and now there is no way to retrieve that information.

Or at least the above is what I perceived to have happened. I don't think you can reverse this, and as such will have to have all the information re-entered.

fenway · November 1, 2007

You should convert everything to UTF8, not just some of the fields... that's something that mysql lets you do for good reason, but in the general case, it's not what you want.

Sign In

Collations, Charset's, Encoding and everything in between

Recommended Posts

kensaggy

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

kensaggy

Link to comment

Share on other sites

aschk

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information