Jump to content

Character encoding = $pain


soma56

Recommended Posts

Wow, what a pain. I have some strings that contains some f-uped Lativa characters. I'm having a heck of a time trying to convert them over to anything. Mysql won't recognize them when I attempt to update a table and hence they must be changed.

 

This character, for example, looks like a simple small i:

 

ī

 

...but it's not. Here is a page where you'll see all of them with (http://www.eki.ee/letter/chardata.cgi?lang=lv+Latvian&script=latin).

 

And for the record here is all of them.

 

ē Ē, ŗ Ŗ, ū Ū, ī Ī, ā Ā, š Š, ģ Ģ, ķ Ķ, ļ Ļ, ž Ž, č Č, ņ Ņ

 

This is what I have tried in terms of converting them to be used with mysql:

 

Didn't work:

$result = preg_replace('/ī/','i', $result);

 

Didn't work:

$current_encoding = mb_detect_encoding($result, 'auto');

    $result = iconv($current_encoding, 'UTF-8', $result);

 

Didn't work:

$result = str_replace("ī", "i", $result);

 

I'm baffled at why I can't get these characters to convert in PHP. It seems pretty direct to me. Can anyone recommend anything?

 

Link to comment
Share on other sites

If you ensure that you database columns are set to a utf8 characterset and your PHP script serve a utf8 encoded page then you shouldn't have too many issues storing and displaying the odd characters.  The areas that you will still have problems is searching and manipulating the strings containing them as you'll have to ensure that you use multi-byte aware functions.

Link to comment
Share on other sites

Thanks for everyones input. The mysql Collation is latin1_swedish_ci (the default for varchar).

 

Although it displays in Firefox correctly I can see in the source that the text is as follows:

 

SIA

 

The above reads 'SIA'. The gibberish is how it updates into my database. I tried to detect the encoding which stated it was ASCII. I attempted to change it to UTF-8 but it still says that it's ASCII:

 

 

echo mb_detect_encoding($result);

echo "<br />";

 

$current_encoding = mb_detect_encoding($result, 'auto');

    $result = iconv($current_encoding, 'UTF-8//IGNORE', $result);

 

echo mb_detect_encoding($result);

echo "<br />";

 

I don't know anything about encoding and I'm really confused. How do I convert something like &#x53;&#x49;&#x41; to simple plain text that I'm able to upload to my DB?

 

Link to comment
Share on other sites

&#x53;&#x49;&#x41;

 

Those are HTML entity sequences. To convert them to their actual characters you just take the hex code, convert it to decimal and then get that chr() value.  It can be done with a simple preg_replace:

$str="&#x53;&#x49;&#x41;";
$re = '/&#x([0-9A-F]{2});/ie';
var_dump(preg_replace($re, 'chr(hexdec("$1"));', $str));

 

Once you've converted it back to a normal string using a replace such as that, you should be able to use the encoding functions to detect or convert it as necessary.

 

Link to comment
Share on other sites

Great, that seems to have solved half my issue. It seems some of the characters are still funny. I think this has something to do with the website being Latvia (and hence some of the characters):

 

Here's the text exactly as it is displayed in firefox:

 

Publiskie

 

Here is what the source code looks like for the above (Latvian) word:

 

P&#000117bli&#000115kie

 

So it seems that (for whatever reason) it's translating two of the letters (U and S) to the following:

 

&#000117  &#000115

 

These seem to be a little longer then the ones in my previous post. If they are not HTML entity sequences then what are they?

Link to comment
Share on other sites

I can seem to 'convert it' where displaying it is concerned but I don't know enough about preg_match to convert it within the source:

 

$decode = "P&#000117bli&#000115kie m&#000111bil&#000111 &#000116elef&#000111&#000110&#000117 &#000116&#000299kli ";

$decode  = htmlentities($decode );
echo $decode; 
echo "<br />";

$decode  = html_entity_decode($decode );
echo $decode;
echo "<br />";
echo htmlspecialchars_decode($decode);

 

It's still showing up as

 

P&#000117bli&#000115kie m&#000111bil&#000111 &#000116elef&#000111&#000110&#000117 &#000116&#000299kli

 

...in the source.

 

Googling a solution for this issue is next to impossible when you don't know anything html decoding. The solutions (that I have found and tried anyways) seem to be focused on providing the correct on-page display rather then the source code. 

 

The puzzling part for me is that I've always thought of preg_match as find/replace sort of function.

 

Just modify the preg script given by kicken.

 

If that's the case how would anyone know that

 

&#000117 is equal to the letter u?

 

My face is twitching at this point  ;)

Link to comment
Share on other sites

&#000117; is an entity just like the hex ones I showed above.  Instead of using a hex encoded value it is an octal encoded value though.

 

html_entity_decode should be able to decode them into their char values.  You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity.

 

 

 

Link to comment
Share on other sites

&#000117; is an entity just like the hex ones I showed above.  Instead of using a hex encoded value it is an octal encoded value though.

 

html_entity_decode should be able to decode them into their char values.  You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity.

 

Ok, I'm assuming then that hexdec would change to octdec.

 

No luck:

 

$decodethis = "&#117";

$re = '/&#x([0-9A-F]{2});/ie';
$decodethis = preg_replace($re, 'chr(octdec("$1"));', $decodethis);

echo $decodethis; 

 

html_entity_decode also does not work:

$decodethis = "&#117";

$decodethis = html_entity_decode($decodethis);

echo $decodethis; 

 

The browser display is fine, however, the source code is still showing the octal encoded value... :confused:

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.