Jump to content

utf and csv question


NotionCommotion

Recommended Posts

Major breakthrough.  utf8_encode() allows me to view utf-8 characters in the browser.  Therefore, my source data file must be ISO-8859-1 encoded text, right?   Am I understanding this correctly?


<?php
//mb_internal_encoding("UTF-8");
header('Content-type: text/html; charset=utf-8');
$file = fopen('some_csv_file_created_by_excel.csv', "r");
ob_start();
while (($spec = fgetcsv($file, 100000, ",")) !== FALSE){
    echo($spec[0].' '.utf8_encode($spec[0]).'<br>');
}
$string=ob_get_clean();
?>
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>utf</title>
    </head>
    <body>
        <p><?php echo($string);?></p>
    </body>
</html>
Link to comment
https://forums.phpfreaks.com/topic/293862-utf-and-csv-question/
Share on other sites

utf8_encode() is poorly named. It actually transcodes data from one encoding (ISO-8859-1) to another (UTF-8). So it only makes sense if your source data has the “wrong” encoding and can only be fixed at runtime.

 

If the source data is already encoded with UTF-8, or if there's any chance you turn it into that, the function is not necessary. Transcoding data at runtime is obviously inefficient, so it should be avoided whenever possible.

Thanks Jacques,

 

Yes, I had recently become aware of the poor name of utf8_encode().

 

As far as I can tell, Excel cannot export text with UTF-8 encoding.  It is possible to take several steps to do so (export to Google equivalent, etc), but that is not ideal.  Excel can export to Unicode text and since I have only one column, this might work, but it mysteriously puts quotes around some of the entries.  Obviously, not a PHP topic and no need to respond unless you want to.

 

My main reason for my original post was making sure I understood what I was witnessing.  If without utf8_encode(), it would display � for non-ASCI characters, then the source file was SO-8859-1 (or at least not UTF-8)?

Yes, those symbols mean “not a valid character”. UTF-8 uses a specific byte pattern, and most ISO-encoded characters don't comply to that pattern, so you get no character at all, not just a wrong character.

 

If you try it the other way round (UTF-8 misinterpreted as ISO), you'll see cryptic characters instead, because UTF-8 is formally valid ISO.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.