soma56 Posted February 1, 2012 Share Posted February 1, 2012 Wow, what a pain. I have some strings that contains some f-uped Lativa characters. I'm having a heck of a time trying to convert them over to anything. Mysql won't recognize them when I attempt to update a table and hence they must be changed. This character, for example, looks like a simple small i: ī ...but it's not. Here is a page where you'll see all of them with (http://www.eki.ee/letter/chardata.cgi?lang=lv+Latvian&script=latin). And for the record here is all of them. ē Ē, ŗ Ŗ, ū Ū, ī Ī, ā Ā, š Š, ģ Ģ, ķ Ķ, ļ Ļ, ž Ž, č Č, ņ Ņ This is what I have tried in terms of converting them to be used with mysql: Didn't work: $result = preg_replace('/ī/','i', $result); Didn't work: $current_encoding = mb_detect_encoding($result, 'auto'); $result = iconv($current_encoding, 'UTF-8', $result); Didn't work: $result = str_replace("ī", "i", $result); I'm baffled at why I can't get these characters to convert in PHP. It seems pretty direct to me. Can anyone recommend anything? Quote Link to comment Share on other sites More sharing options...
scootstah Posted February 1, 2012 Share Posted February 1, 2012 What is the charset of the MySQL field? Quote Link to comment Share on other sites More sharing options...
kicken Posted February 1, 2012 Share Posted February 1, 2012 If you ensure that you database columns are set to a utf8 characterset and your PHP script serve a utf8 encoded page then you shouldn't have too many issues storing and displaying the odd characters. The areas that you will still have problems is searching and manipulating the strings containing them as you'll have to ensure that you use multi-byte aware functions. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 1, 2012 Share Posted February 1, 2012 The Latvian alphabet uses either UTF-8 or ISO-8859-4 as the charset. Please check the collation of your MySQL fields. Quote Link to comment Share on other sites More sharing options...
soma56 Posted February 6, 2012 Author Share Posted February 6, 2012 Thanks for everyones input. The mysql Collation is latin1_swedish_ci (the default for varchar). Although it displays in Firefox correctly I can see in the source that the text is as follows: SIA The above reads 'SIA'. The gibberish is how it updates into my database. I tried to detect the encoding which stated it was ASCII. I attempted to change it to UTF-8 but it still says that it's ASCII: echo mb_detect_encoding($result); echo "<br />"; $current_encoding = mb_detect_encoding($result, 'auto'); $result = iconv($current_encoding, 'UTF-8//IGNORE', $result); echo mb_detect_encoding($result); echo "<br />"; I don't know anything about encoding and I'm really confused. How do I convert something like SIA to simple plain text that I'm able to upload to my DB? Quote Link to comment Share on other sites More sharing options...
kicken Posted February 6, 2012 Share Posted February 6, 2012 SIA Those are HTML entity sequences. To convert them to their actual characters you just take the hex code, convert it to decimal and then get that chr() value. It can be done with a simple preg_replace: $str="SIA"; $re = '/&#x([0-9A-F]{2});/ie'; var_dump(preg_replace($re, 'chr(hexdec("$1"));', $str)); Once you've converted it back to a normal string using a replace such as that, you should be able to use the encoding functions to detect or convert it as necessary. Quote Link to comment Share on other sites More sharing options...
soma56 Posted February 6, 2012 Author Share Posted February 6, 2012 Great, that seems to have solved half my issue. It seems some of the characters are still funny. I think this has something to do with the website being Latvia (and hence some of the characters): Here's the text exactly as it is displayed in firefox: Publiskie Here is what the source code looks like for the above (Latvian) word: Publiskie So it seems that (for whatever reason) it's translating two of the letters (U and S) to the following: u s These seem to be a little longer then the ones in my previous post. If they are not HTML entity sequences then what are they? Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 6, 2012 Share Posted February 6, 2012 They are, but in a different format. Just modify the preg script given by kicken. Quote Link to comment Share on other sites More sharing options...
soma56 Posted February 7, 2012 Author Share Posted February 7, 2012 I can seem to 'convert it' where displaying it is concerned but I don't know enough about preg_match to convert it within the source: $decode = "Publiskie mobilo telefonu tīkli "; $decode = htmlentities($decode ); echo $decode; echo "<br />"; $decode = html_entity_decode($decode ); echo $decode; echo "<br />"; echo htmlspecialchars_decode($decode); It's still showing up as Publiskie mobilo telefonu tīkli ...in the source. Googling a solution for this issue is next to impossible when you don't know anything html decoding. The solutions (that I have found and tried anyways) seem to be focused on providing the correct on-page display rather then the source code. The puzzling part for me is that I've always thought of preg_match as find/replace sort of function. Just modify the preg script given by kicken. If that's the case how would anyone know that u is equal to the letter u? My face is twitching at this point Quote Link to comment Share on other sites More sharing options...
kicken Posted February 7, 2012 Share Posted February 7, 2012 u is an entity just like the hex ones I showed above. Instead of using a hex encoded value it is an octal encoded value though. html_entity_decode should be able to decode them into their char values. You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity. Quote Link to comment Share on other sites More sharing options...
soma56 Posted February 7, 2012 Author Share Posted February 7, 2012 u is an entity just like the hex ones I showed above. Instead of using a hex encoded value it is an octal encoded value though. html_entity_decode should be able to decode them into their char values. You just need to make sure they are in the correct format, meaning they need to have a ; after the number sequence to end the entity. Ok, I'm assuming then that hexdec would change to octdec. No luck: $decodethis = "u"; $re = '/&#x([0-9A-F]{2});/ie'; $decodethis = preg_replace($re, 'chr(octdec("$1"));', $decodethis); echo $decodethis; html_entity_decode also does not work: $decodethis = "u"; $decodethis = html_entity_decode($decodethis); echo $decodethis; The browser display is fine, however, the source code is still showing the octal encoded value... Quote Link to comment Share on other sites More sharing options...
kicken Posted February 7, 2012 Share Posted February 7, 2012 $decodethis = "u"; That is not a valid entity. u is, note the ; on the end. The browser display is fine, however, the source code is still showing the octal encoded value... Browsers are more forgiving of errors. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.