Jump to content

Japanese characters pulled using CURL don't display properly.


dpacmittal

Recommended Posts

have you tried modifying the content type? Of both the html and php to the japanese content type? http://en.wikipedia.org/wiki/Japanese_language_and_computers

No, there are just few words in japanese... the rest are in english.

 

If you use differing languages within one document then you need according to the W3C (http://www.w3.org/TR/html401/struct/dirlang.html) specification add a lang attribute specifying the language of the contents of the element and if necessary even provide a dir attribute.

 

More specific: http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.2

Link to comment
Share on other sites

Doing a simple test, grabbing the contents of http://en.wikipedia.org/wiki/Japanese_language, gives me properly encoded characters. But that might be because Wikipedia specifies the lang attribute on elements containing Japanese characters, and sets the content charset to UTF-8. I'm using this code:

 

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/wiki/Japanese_language');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1) Gecko/20090624 Firefox/3.5');
$contents = curl_exec($ch);
curl_close($ch);
echo $contents;
?>

 

If that doesn't work with your source, try setting the header

 

header('Content-type: text/html; charset=utf-8');

and/or the content-type HTTP header used in the cURL session

 

curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type: text/html; charset=utf-8'));

Link to comment
Share on other sites

Doing a simple test, grabbing the contents of http://en.wikipedia.org/wiki/Japanese_language, gives me properly encoded characters. But that might be because Wikipedia specifies the lang attribute on elements containing Japanese characters, and sets the content charset to UTF-8. I'm using this code:

 

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/wiki/Japanese_language');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1) Gecko/20090624 Firefox/3.5');
$contents = curl_exec($ch);
curl_close($ch);
echo $contents;
?>

 

If that doesn't work with your source, try setting the header

 

header('Content-type: text/html; charset=utf-8');

and/or the content-type HTTP header used in the cURL session

 

curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type: text/html; charset=utf-8'));

Thanks.. that was helpful. Half my problem is solved. It displays the characters fine when I run in my script. What the other part of my script does is put it in wordpress using XMLRPC.

 

http://www.timepass247.com/mylife/

 

Check how the characters are displaying. The characters are fine in the script, that means there's a problem when posting it through XMLRPC.

I used this to encode it into UTF-8.

$request = xmlrpc_encode_request('metaWeblog.newPost',$params, Array('encoding'=>'utf-8'));

Am I wrong somewhere? This encoding thing really baffles me.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.