Jump to content

[SOLVED] Converting characters


watsmyname

Recommended Posts

well,

 

lets say text in my language is "वनमान्छे जोडीलाई सन्तान" [without quotes], this text is saved in database as like this format "एकै गा",  I have search function in my website where user types search keyword in my language. So problem is it doesnt work because my user types say "वनमान्छे " in searchbox, but in database it is saved in different character format. So how to convert my language into the format saved in database, so that i can match database against the character that is converted. ?

 

Thanks

in watsmyname

Link to comment
Share on other sites

You'd be better storing the data in your database as UTF-8, or as the appropriate character set for your language.

 

The following code will convert an individual character such as न to its HTML encoded equivalent (where possible)

function CHARACTER($character) {
$character	= self::flattenSingleValue($character);

if (function_exists('mb_convert_encoding')) {
	return mb_convert_encoding('&#'.intval($character).';', 'UTF-8', 'HTML-ENTITIES');
} else {
	return chr(intval($character));
}
}

You could loop through your search string, performing this conversion against each character, to build up a new string that (hopefully) should match what's stored on your database

Link to comment
Share on other sites

Apologies, Rumplestiltskin, I simply cut and pasted it from a class.

The line in question simply traps in case an array is passed into the function rather than a character, and can be completely removed.

 

function CHARACTER($character) {
if (function_exists('mb_convert_encoding')) {
	return mb_convert_encoding('&#'.intval($character).';', 'UTF-8', 'HTML-ENTITIES');
} else {
	return chr(intval($character));
}
}

 

Your best solution would still have been to modify your database charset

Link to comment
Share on other sites

Thanks for the quick reply, that wouldnt return the character in format ा, it returns something weird binary like character. I have seen SMF forum's database, they too save our language characters in the format ा, but when i search it from forum with some words in my language it gives correct result...i dont know how they are converting user inputed unicode character into this format and search it in the database, got any idea?

Link to comment
Share on other sites

Looking at it, the board seems to be messing with some of the code:

Anthing that looks like &# should be &#

function CHARACTER($character) {
if (function_exists('mb_convert_encoding')) {
	return mb_convert_encoding('&#'.intval($character).';', 'UTF-8', 'HTML-ENTITIES');
} else {
	return chr(intval($character));
}
}

The idea is to take the character, convert it to its HTML entity number

If that doesn't work, try:

function CHARACTER($character) {
if (function_exists('mb_convert_encoding')) {
	return '&#'.intval($character).';';
} else {
	return intval($character);
}
}

 

Link to comment
Share on other sites

Well that does tell us that the mbstring extension for PHP isn't enabled. Without that, you're going to have a lot of difficulty manipulating non-ANSI characters in any way.

 

If you're going to be working with character sets like वनमान्छे, (and can't use mbstring or iconv) then you need to be using UTF-8 (or the actual character set in question) consistently between your page, the database and all communications between. You are going to have to use UTF-8 for your database rather than converting to html entities.

Link to comment
Share on other sites

What? The OP is right, intval() returns 0 on strings/characters that can't be directly casted to an integer. Has got nothing to do with the multibyte functions (correct me if I'm wrong).

 

Simply try to use htmlentities() (code tags f**k up the characters):

 

<?php

$str = 'वनमान्छे जोडीलाई सन्तान';

$entities = htmlentities($str, ENT_QUOTES, 'UTF-8');

?>

 

Remember to use the appropriate second parameter, depending on how quotes are stored in your database.

 

But I agree with Mark in that you should store the data as UTF-8 characters instead. Takes up much less space.

Link to comment
Share on other sites

What? The OP is right, intval() returns 0 on strings/characters that can't be directly casted to an integer. Has got nothing to do with the multibyte functions (correct me if I'm wrong).

Correct, the use of intval is a) not correct, b) not capable of handling multibyte... but that's also how the routine was trying to provide a fallback if mb_convert_encoding wasn't available... nothing to do with multibyte strings.

 

There must have been some reason why I used intval() at the time, but no idea what it was now.

Link to comment
Share on other sites

Having mulled it over, I've figured out my error.

 

Because it was the wrong function for the OPs solution - which is still to change his database - it was the reverse function, the multibyte equivalent of chr() rather than of ord().... feed it a numeric value and it returns the UTF-8 character with that value.

 

Link to comment
Share on other sites

What? The OP is right, intval() returns 0 on strings/characters that can't be directly casted to an integer. Has got nothing to do with the multibyte functions (correct me if I'm wrong).

 

Simply try to use htmlentities() (code tags f**k up the characters):

 

<?php

$str = 'वनमान्छे जोडीलाई सन्तान';

$entities = htmlentities($str, ENT_QUOTES, 'UTF-8');

?>

 

Remember to use the appropriate second parameter, depending on how quotes are stored in your database.

 

But I agree with Mark in that you should store the data as UTF-8 characters instead. Takes up much less space.

This isnt working either

i just want to convert the given string to "&#2319;&#2325;&#2376; &#2327;&#2366;" so that i can search it in the database. BTW what you call this format "&#2319;&#2325;&#2376; &#2327;&#2366;"?

Link to comment
Share on other sites

thanks guys, i found a function that would work perfectly as i wanted

 

<?php
function charset_decode_utf_8 ($string) {
      /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",       
    "''.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",   
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e",
    "''.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
    $string);

    return $string;
} 
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.