phpsycho Posted August 18, 2011 Share Posted August 18, 2011 I came across some Chinese or Japanese characters in my db. They didn't display correctly though.. what function do I need to use so the characters come out correctly? Not translating, just displaying their text correctly. Like in my db it has: ¥Ü¡¼¥«¥í¥¤¥É¤Î²Î»ìÃÖ¾ì but what the text actually looks like is this: ボーカロイドの歌詞置場 Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/ Share on other sites More sharing options...
requinix Posted August 18, 2011 Share Posted August 18, 2011 Japanese. I don't think Chinese people are quite as obsessed with Vocaloid. That particular string is in EUC-JP encoding. The most common international encoding (for handling as many characters from as many alphabets as possible) is probably UTF-8. I suggest you try to use that for everything. Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259201 Share on other sites More sharing options...
phpsycho Posted August 18, 2011 Author Share Posted August 18, 2011 Yeah I the field in the mysql table set to utf-8 but it still doesn't display right. Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259202 Share on other sites More sharing options...
xyph Posted August 19, 2011 Share Posted August 19, 2011 Make sure your PHP page is encoded in UTF-8 and your HTML head contains <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259211 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 Its actually a scraper. so its scraping data from websites like their title, desc, keywords, etc. so I picked up a couple Japanese sites and they have that meta tag, just different charset. So this is my code that I have to convert the title of the site.. $charset = 'None'; $description=''; $keywords=''; preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers); if(count($headers) > 0) { if(preg_match("/<meta[^>]*http-equiv[^>]*charset=(.*)(\"|')>/Ui",$headers[1], $results)){ $charset= $results[1]; } else { $charset='None'; } } else { $ok=0; //echo 'No HEAD - Might be malformed or be a feed<br />'; } if($charset != 'None'){ $title = mb_convert_encoding($title, "UTF-8", $charset); } if($title == null){ $title = $url; } Shouldn't that fix the problem, using that mb_convert_encoding? Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259215 Share on other sites More sharing options...
xyph Posted August 19, 2011 Share Posted August 19, 2011 It might. I've never messed with converting encodings, I always stick with UTF-8. I know that the manual for that function has tons of user comments about odd behavior though Give it a shot, and let us know. Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259225 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 Nope it still won't work Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259254 Share on other sites More sharing options...
requinix Posted August 19, 2011 Share Posted August 19, 2011 Have you converted the string from EUC-JP? You have to do that. It doesn't just magically change to the right encoding if it started off in the wrong one. Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259466 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 Thats what this piece of code does... if(preg_match("/<meta[^>]*http-equiv[^>]*charset=(.*)(\"|')>/Ui",$headers[1], $results)){ $charset= $results[1]; } else { $charset='None'; } if($charset != 'None'){ $title = mb_convert_encoding($title, "UTF-8", $charset); } Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259469 Share on other sites More sharing options...
requinix Posted August 19, 2011 Share Posted August 19, 2011 That should be $headers[0]. Also make sure their webpages are reporting the right character encoding. Which there's a good chance they are. Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259517 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 I believe \[0] would be the whole tag, not just the charset. I even did a small bit of code to test and it still don't work.. $url='http://010701070107.blog5.fc2.com/'; $userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_HEADER, true); curl_setopt($ch, CURLOPT_NOBODY, true); $buffer = curl_exec($ch); $curl_info = curl_getinfo($ch); curl_close($ch); $header_size = $curl_info['header_size']; $header = substr($buffer, 0, $header_size); preg_match("~charset=([^\s]*)\s~is", $header, $header); $header = mb_convert_encoding("Japanese text", "UTF-8", $header[1]); echo $header; That just spits out this: ???若?????ゃ?????�臀?? Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259522 Share on other sites More sharing options...
xyph Posted August 19, 2011 Share Posted August 19, 2011 Here's how I echo'ed your site in proper UTF-8 <?php $url='http://010701070107.blog5.fc2.com/'; $userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_BINARYTRANSFER, true); //curl_setopt($ch, CURLOPT_HEADER, true); //curl_setopt($ch, CURLOPT_NOBODY, true); $buffer = curl_exec($ch); $curl_info = curl_getinfo($ch); curl_close($ch); $expr = '%text/html; charset=euc-jp%'; $buffer = preg_replace( $expr, 'text/html; charset=UTF-8', $buffer ); $buffer = mb_convert_encoding($buffer,'UTF-8','EUC-JP'); echo $buffer; ?> Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259532 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 perfect, only problem now is.. what happens if theres a site without that meta tag but has japanese text or some other language? How could I fix it to read that text? Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259545 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 sigh.. still having issues with this. $html = $response; $url = $info['url']; $erurl = $info['url']; $code = $info['http_code']; $domain = parseHOST($url); $url = str_ireplace("www.","",$url); $url = rtrim($url, "/"); if($html && $code >= 200 && $code < 300){ $charset = 'None'; $description=''; $keywords=''; preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers); if(preg_match("/<meta[^>]*http-equiv[^>]*charset=([^\"']*)>/Ui",$headers[1], $results)){ $expr = '%charset=([^"\']*)%'; $html = preg_replace( $results[1], 'UTF-8', $results[0]); $html = mb_convert_encoding($html,'UTF-8',$results[1]); } if(preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers)){ preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers); } Sorry I'm kinda new to the whole charset thing. its still coming out with ¥Ü¡¼¥«¥í¥¤¥É¤Î²Î»ìÃÖ¾ Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259551 Share on other sites More sharing options...
xyph Posted August 19, 2011 Share Posted August 19, 2011 What exactly are you trying to scrape? Did my example echo a valid page to the browser? Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259553 Share on other sites More sharing options...
phpsycho Posted August 19, 2011 Author Share Posted August 19, 2011 I'm scraping a bunch of sites, but this one in particular is http://010701070107.blog5.fc2.com/ I am scraping it for title,desc, keywords. So I wanna pull the charset and convert it first then pull the important data from the converted charset curl response. and yes yours worked. now I'm just trying to integrate it into my script, but having a hard time doing so Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259556 Share on other sites More sharing options...
phpsycho Posted August 20, 2011 Author Share Posted August 20, 2011 Okay its not the converting of the data thats the problem.. It works.. but in order for it to work I have to echo the html from the website after echoing the title, desc, and keywords. If I don't do that then it won't display the converted title, desc, keys correctly. Any clue why thats happening? EDIT: uhh okay, it works when it wants to and if it wants to then the html needs to be echoed for it to work. lol wtf Quote Link to comment https://forums.phpfreaks.com/topic/245156-char-encoding/#findComment-1259648 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.