Jump to content


Photo

Russian chars part II: Moving to UTF8


  • Please log in to reply
6 replies to this topic

#1 ctiberg

ctiberg
  • Members
  • Pip
  • Newbie
  • 7 posts

Posted 11 October 2006 - 02:31 PM

Hello!

I posted earlier about having problems with russian characters. I now have decided to move to UTF8, but can't seem to get this to work. My test system contains 3 scripts - an editor (a form), a storer, and a viewer.

I seem to be able to get the stuff into the database in UTF8, but then I can't show it on screen - all I get is garbage. So I hope for some help here, preferrably hands-on :)

The editor is just a form, with the following "specials":

<?php header("Content-type: text/html; charset=utf-8"); ?>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
<form name="inputfrm" method="POST" action="lagra_txt.php" accept-charset="utf-8">

Despite specifying utf-8 in the accept-charset, I seem to get windows-1252. Why?

On to the storer. Here I've got this:

// Connect to the DB using mysql_connect and mysql_select_db

  $sql = "SET NAMES 'utf8'";
  mysql_query($sql);

  // The lines below were copied from an article on mysql.com - they check if I got UTF-8
  $test  = $_POST["charset_check"];
  if (bin2hex($test) == "c3a4e284a2c2ae")
    $OK = true;
  elseif (bin2hex($test) == "e499ae")
    $OK = false;
  else
    die("Sorry, I didn't understand the character set of the data you sent!");

  foreach ($_POST as $key => $val)
    {
      if ($key == "charset_check") continue;
      if ($val != "")
        {
          if (!$OK) $val = iconv("windows-1252", "utf-8", $val);
          $sql = "UPDATE luka_texter SET `Text`='".$val."' WHERE ID='".$key."' AND Sprak='ru'";
          mysql_query($sql);
        }
    }

As I said, this seems to get the stuff into the DB alright, and I think it's in UTF8 in there (at least it looks like junk, which is what UTF8 seems to me).

The viewer is very simple, like this:

<?php header("Content-type: text/html; charset=utf-8"); ?>
<html>
<head>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
<title>DB-edit</title>
</head>
<body>
<?php
// Connect to the DB using mysql_connect and mysql_select_db

  $sql = "SET NAMES 'utf8'";
  mysql_query($sql);

  $sql = "SELECT ID, Text FROM luka_texter WHERE Sprak='ru'";
  $res = mysql_query($sql) or die(mysql_error());
  while ($rad = mysql_fetch_assoc($res))
    print $rad["ID"]." ".iconv("utf-8", "windows-1252", $rad["Text"])."<br>";
  mysql_free_result($res);
?>
</body></html>

The trouble I get is that some texts are truncated, some characters replace by question marks, and so on. So, can anyone point out where I do something wrong?
Best regards, [br] Christian Tiberg

#2 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 October 2006 - 02:38 PM

What is the database using?

SHOW VARIABLES LIKE 'character\_set%'

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#3 ctiberg

ctiberg
  • Members
  • Pip
  • Newbie
  • 7 posts

Posted 11 October 2006 - 02:53 PM

Everything is set to latin1 when I do the above in MyDB Studio, except for character_set_system, which is set to utf8.

The Text column have had its character set to UTF8, though, using:

DROP TABLE IF EXISTS `luka_texter`;
CREATE TABLE `luka_texter` (
  `ID` varchar(50) NOT NULL default '',
  `Sprak` char(2) NOT NULL default '',
  `TEXT` text CHARACTER SET utf8,
  PRIMARY KEY  (`ID`,`Sprak`)
) ENGINE=MyISAM DEFAULT CHARACTER SET=latin1;
Best regards, [br] Christian Tiberg

#4 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 October 2006 - 04:03 PM

What is your input?
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#5 ctiberg

ctiberg
  • Members
  • Pip
  • Newbie
  • 7 posts

Posted 11 October 2006 - 04:41 PM

The input is from a form (I gave you the form element syntax above), containing some 40-50 text strings that's been translated into russian from english. I copy them from an Excel sheet one at a time, and then paste them into each form field. Each form field is given a name that is then used as the ID in the MySQL table.

This is of course a very simple example, but I need this to work before I go on to the rest of the site.
Best regards, [br] Christian Tiberg

#6 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 October 2006 - 06:46 PM

This is working for me. Note that I changed the table a little.

<?php header("Content-type: text/html; charset=utf-8"); ?>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=utf-8">
<pre>
<?php
	if ($_POST) {
		### Show what we received and proceed with database interaction.
		print_r($_POST);
		### Connect, select, drop/create if needed.
		mysql_connect('localhost', 'user', 'password') or die;
		mysql_select_db('test') or die (mysql_error());
		$table_check = mysql_query('DESC `luka_texter`');
		if (mysql_error()) {
			mysql_query('
				CREATE TABLE `luka_texter` (
				`ID` INT NOT NULL AUTO_INCREMENT,
				`Sprak` char(2) NOT NULL,
				`TEXT` text CHARACTER SET utf8,
				PRIMARY KEY  (`ID`,`Sprak`)
				) ENGINE=MyISAM DEFAULT CHARACTER SET=latin1;
			') or die (mysql_error());
		}
		### Insert.
		mysql_query("INSERT INTO `luka_texter` (`Sprak`, `TEXT`) VALUES ('ru', '{$_POST['utf8_textarea']}')") or die (mysql_error());
		$query = mysql_query('SELECT TEXT FROM `luka_texter`') or die (mysql_error());
		while ($row = mysql_fetch_array($query)) {
			echo $row['TEXT'], '<br/>';
		}
	}

	### Create some characters from the Cyrillic block...
	$characters  = pack('c*', 0xD0, 0x89);
	$characters .= pack('c*', 0xD0, 0x8A);
	$characters .= pack('c*', 0xD0, 0x8B);
	$characters .= pack('c*', 0xD0, 0x8C);
	$characters .= pack('c*', 0xD0, 0x8D);
	$characters .= pack('c*', 0xD0, 0x8E);
	$characters .= pack('c*', 0xD0, 0x8F);
	### ...and put them in the form...
?>

<form name="utf8_test" method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>" accept-charset="utf-8">
	<textarea name="utf8_textarea"><?php echo $characters; ?></textarea>
	<input type="submit"/>
</form>

</pre>

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#7 ctiberg

ctiberg
  • Members
  • Pip
  • Newbie
  • 7 posts

Posted 12 October 2006 - 08:32 AM

I had a rave reply in this textbox, until I tried it out on the production server. There, it has the same problems as my own attempts. That is it gets most of the text right, but some of it is replaced by ?'s.... So I'll try to get a response out of our provider, which I guess will prove very futile. Sigh.
Best regards, [br] Christian Tiberg




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users