Character encoding = $pain

soma56 · February 1, 2012

Wow, what a pain. I have some strings that contains some f-uped Lativa characters. I'm having a heck of a time trying to convert them over to anything. Mysql won't recognize them when I attempt to update a table and hence they must be changed.

This character, for example, looks like a simple small i:

ī

...but it's not. Here is a page where you'll see all of them with (http://www.eki.ee/letter/chardata.cgi?lang=lv+Latvian&script=latin).

And for the record here is all of them.

ē Ē, ŗ Ŗ, ū Ū, ī Ī, ā Ā, š Š, ģ Ģ, ķ Ķ, ļ Ļ, ž Ž, č Č, ņ Ņ

This is what I have tried in terms of converting them to be used with mysql:

Didn't work:

$result = preg_replace('/ī/','i', $result);

Didn't work:

$current_encoding = mb_detect_encoding($result, 'auto');

$result = iconv($current_encoding, 'UTF-8', $result);

Didn't work:

$result = str_replace("ī", "i", $result);

I'm baffled at why I can't get these characters to convert in PHP. It seems pretty direct to me. Can anyone recommend anything?

scootstah · February 1, 2012

What is the charset of the MySQL field?

kicken · February 1, 2012

If you ensure that you database columns are set to a utf8 characterset and your PHP script serve a utf8 encoded page then you shouldn't have too many issues storing and displaying the odd characters. The areas that you will still have problems is searching and manipulating the strings containing them as you'll have to ensure that you use multi-byte aware functions.

silkfire · February 1, 2012

The Latvian alphabet uses either UTF-8 or ISO-8859-4 as the charset. Please check the collation of your MySQL fields.

soma56 · February 6, 2012

Thanks for everyones input. The mysql Collation is latin1_swedish_ci (the default for varchar).

Although it displays in Firefox correctly I can see in the source that the text is as follows:

SIA

The above reads 'SIA'. The gibberish is how it updates into my database. I tried to detect the encoding which stated it was ASCII. I attempted to change it to UTF-8 but it still says that it's ASCII:

echo mb_detect_encoding($result);

echo "<br />";

$current_encoding = mb_detect_encoding($result, 'auto');

$result = iconv($current_encoding, 'UTF-8//IGNORE', $result);

echo mb_detect_encoding($result);

echo "<br />";

I don't know anything about encoding and I'm really confused. How do I convert something like SIA to simple plain text that I'm able to upload to my DB?

kicken · February 6, 2012

SIA

Those are HTML entity sequences. To convert them to their actual characters you just take the hex code, convert it to decimal and then get that chr() value. It can be done with a simple preg_replace:

$str="&#x53;&#x49;&#x41;";
$re = '/&#x([0-9A-F]{2});/ie';
var_dump(preg_replace($re, 'chr(hexdec("$1"));', $str));

Once you've converted it back to a normal string using a replace such as that, you should be able to use the encoding functions to detect or convert it as necessary.

soma56 · February 6, 2012

Great, that seems to have solved half my issue. It seems some of the characters are still funny. I think this has something to do with the website being Latvia (and hence some of the characters):

Here's the text exactly as it is displayed in firefox:

Publiskie

Here is what the source code looks like for the above (Latvian) word:

P&#000117bli&#000115kie

So it seems that (for whatever reason) it's translating two of the letters (U and S) to the following:

&#000117 &#000115

These seem to be a little longer then the ones in my previous post. If they are not HTML entity sequences then what are they?

silkfire · February 6, 2012

They are, but in a different format. Just modify the preg script given by kicken.

soma56 · February 7, 2012

I can seem to 'convert it' where displaying it is concerned but I don't know enough about preg_match to convert it within the source:

$decode = "P&#000117bli&#000115kie m&#000111bil&#000111 &#000116elef&#000111&#000110&#000117 &#000116&#000299kli ";

$decode  = htmlentities($decode );
echo $decode; 
echo "<br />";

$decode  = html_entity_decode($decode );
echo $decode;
echo "<br />";
echo htmlspecialchars_decode($decode);

It's still showing up as

P&#000117bli&#000115kie m&#000111bil&#000111 &#000116elef&#000111&#000110&#000117 &#000116&#000299kli

...in the source.

Googling a solution for this issue is next to impossible when you don't know anything html decoding. The solutions (that I have found and tried anyways) seem to be focused on providing the correct on-page display rather then the source code.

The puzzling part for me is that I've always thought of preg_match as find/replace sort of function.

Just modify the preg script given by kicken.

If that's the case how would anyone know that

&#000117 is equal to the letter u?

My face is twitching at this point

kicken · February 7, 2012

u is an entity just like the hex ones I showed above. Instead of using a hex encoded value it is an octal encoded value though.

html_entity_decode should be able to decode them into their char values. You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity.

soma56 · February 7, 2012

u is an entity just like the hex ones I showed above. Instead of using a hex encoded value it is an octal encoded value though.

html_entity_decode should be able to decode them into their char values. You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity.

Ok, I'm assuming then that hexdec would change to octdec.

No luck:

$decodethis = "&#117";

$re = '/&#x([0-9A-F]{2});/ie';
$decodethis = preg_replace($re, 'chr(octdec("$1"));', $decodethis);

echo $decodethis;

html_entity_decode also does not work:

$decodethis = "&#117";

$decodethis = html_entity_decode($decodethis);

echo $decodethis;

The browser display is fine, however, the source code is still showing the octal encoded value... :confused:

kicken · February 7, 2012

$decodethis = "&#117";

That is not a valid entity. &#117; is, note the ; on the end.

The browser display is fine, however, the source code is still showing the octal encoded value...

Browsers are more forgiving of errors.

Sign In

Character encoding = $pain

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information