Jump to content

utf and csv question


NotionCommotion

Recommended Posts

Major breakthrough.  utf8_encode() allows me to view utf-8 characters in the browser.  Therefore, my source data file must be ISO-8859-1 encoded text, right?   Am I understanding this correctly?


<?php
//mb_internal_encoding("UTF-8");
header('Content-type: text/html; charset=utf-8');
$file = fopen('some_csv_file_created_by_excel.csv', "r");
ob_start();
while (($spec = fgetcsv($file, 100000, ",")) !== FALSE){
    echo($spec[0].' '.utf8_encode($spec[0]).'<br>');
}
$string=ob_get_clean();
?>
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>utf</title>
    </head>
    <body>
        <p><?php echo($string);?></p>
    </body>
</html>
Edited by NotionCommotion
Link to comment
Share on other sites

utf8_encode() is poorly named. It actually transcodes data from one encoding (ISO-8859-1) to another (UTF-8). So it only makes sense if your source data has the “wrong” encoding and can only be fixed at runtime.

 

If the source data is already encoded with UTF-8, or if there's any chance you turn it into that, the function is not necessary. Transcoding data at runtime is obviously inefficient, so it should be avoided whenever possible.

Edited by Jacques1
Link to comment
Share on other sites

Thanks Jacques,

 

Yes, I had recently become aware of the poor name of utf8_encode().

 

As far as I can tell, Excel cannot export text with UTF-8 encoding.  It is possible to take several steps to do so (export to Google equivalent, etc), but that is not ideal.  Excel can export to Unicode text and since I have only one column, this might work, but it mysteriously puts quotes around some of the entries.  Obviously, not a PHP topic and no need to respond unless you want to.

 

My main reason for my original post was making sure I understood what I was witnessing.  If without utf8_encode(), it would display � for non-ASCI characters, then the source file was SO-8859-1 (or at least not UTF-8)?

Link to comment
Share on other sites

Yes, those symbols mean “not a valid character”. UTF-8 uses a specific byte pattern, and most ISO-encoded characters don't comply to that pattern, so you get no character at all, not just a wrong character.

 

If you try it the other way round (UTF-8 misinterpreted as ISO), you'll see cryptic characters instead, because UTF-8 is formally valid ISO.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.