Jump to content

Preserving double quotes in an uploaded text file


Marchand

Recommended Posts

This application uploads and processes a text file generated by a datalogger. The raw data file has content like:

"Plot Title: Mill - Logsill"

"#","Date Time, GMT-07:00","Temp, ¡C","Coupler Detached","Coupler Attached","Stopped","End Of File"

1,01/28/10 01:30:00 PM,2.557,Logged,,,

2,01/28/10 02:00:00 PM,2.664,,,,

 

The processing utility ignores the first two lines and starts processing at the "1. 01/28/10..." line. The core code works fine, assuming the data logger file is magically already on the server. However I'm fidning that when I upload the file and use move_uploaded_file() with the above data, the resulting file looks like:

Plot Title: Mill - Logsill

 

#,"Date Time, GMT-07:00","Temp, ÁC",Coupler Detached,Coupler Attached,Stopped,End Of File

 

1,1/28/10 13:30,2.557,Logged,,,

 

2,1/28/10 14:00,2.664,,,,

 

which is highly undesirable.  What I don't know -- and I hope you do -- is how to preserve the exact contents of the original file during the upload process, in particular the double quotes. (The extra '0A' characters are under control). Seems simple in theory, but I'm utterly stumped since the parsing/stripping seems to happen deep within the move_uploaded_file() function itself. So what am I missing here?

 

P.S.

FWIW: The default logger output file has a 'csv' extension but it could be renamed easily enough by the user prior to upload.

 

 

Link to comment
Share on other sites

That's very unusual.  move_uploaded_file() works even on images and zip files, so it normally doesn't do any processing.  Have you tried a script which does move_uploaded_file() only and nothing else, and checked if that alone is removing the quotes?

Link to comment
Share on other sites

Yeah, there are more differences than just the double quotes. The leading zero on the date is dropped and the times are converted from AM/PM to military time. There is absolutely no way that the move_uploaded_file() function is doing this. There must be some other processing code that is making the changes.

Link to comment
Share on other sites

Thank you for your replies -- they're appreciated.

mjdamato: There are definitely more problems than the doble quotes. I picked that issue because it was most visually obvious, and in hopes that the solution to that issue would solve the otehr anomalies.

 

btherl: This is an area I've looked at up down and sideways. basically other than standard checks about file size, memi type, etc. there's no pre-processing of the selected file prior to calling move_uploaded_file().

While the full HTML/PHP code block would be probably too large to usefully post, it all comes down to the following function. Anything strike you here as a possible culprit?

 

= M =

///////// code start /////////

 

protected function processFile($filename, $error, $size, $type, $tmp_name, $overwrite)

  {

$OK = $this->checkError($filename, $error);

if ($OK) {

  $sizeOK = $this->checkSize($filename, $size);

  $typeOK = $this->checkType($filename, $type);

  if ($sizeOK && $typeOK) {

$name = $this->checkName($filename, $overwrite);

$success = move_uploaded_file($tmp_name, $this->_destination . $name);

 

if ($success)

{

$this->_destFileName = $this->_destination . $name;

$message = "The selected file <u> $filename</u> has been uploaded successfully." ;

if ($this->_renamed) {

  $message .= " <br>(NOTE: An existing file with that name was already there, so the uploaded file has been renamed: <u>$name</u>) </>";

}

$this->_messages[] = $message;

}

else

{

  $this->_messages[] = "Could not upload $filename";

}

  }

}

  }

///////// code end /////////

Link to comment
Share on other sites

This is within a class. You are calling several other functions against the file ($this->checkSize,$this->checkType, $this->checkName). Based upon the names, I doubt those functions should be modifying the file. Plus, since you are passing a reference to the file to this function there is no grantee that the file is not modified before the function is called or afterward.

 

Basically, there is nothing in that code - specifically - that would modify the file. It would either be taking place before the function is called, within one of the function that are called in that function, or after the function is called.

Link to comment
Share on other sites

I completely agree. Yet nothing else in the class (or other PHP code) touches that file other than move_uploaded_file() nor does anything on the server touch it from the way PHP left it. (The "check()" functions just validate file size, mime type. etc  and don't touch the file itself). And still the content of the uploaded file is different.

 

Just to explicitly confirm: Am I right in assuming that move_uploaded_file() does not change -- or at least should not change -- any content in the file regardless of mime type, file content, etc? Does it "care" about Unicode?

 

= M =

Link to comment
Share on other sites

Just to explicitly confirm: Am I right in assuming that move_uploaded_file() does not change -- or at least should not change -- any content in the file regardless of mime type, file content, etc? Does it "care" about Unicode?

Not to my knowledge. But, even if it did there is absolutely no way that the move_uploaded_file() function would know how to convert AM/PM time to 24hour time. That's just ridiculous.

 

In any event you can test it yourself quite simply. Just do what btherl stated and create a test page that does nothing but upload a file without all the other code you have.

Link to comment
Share on other sites

How you are looking at or displaying the output that seems to be modified from the original? It's more likely that you have a framework/cms or program editor that is doing this when it is displayed; some cron job is parsing through every new uploaded file it finds in a folder; someone is pranking you; or you have a previously uploaded file that was in the 'incorrect' format and your newly uploaded file is failing or is failing to overwrite the 'incorrect' file for some reason.

 

This is the first post I have ever seen where it has been suggested that move_uploaded_file isn't just calling operating system commands to move the file through the operating system to the destination. However, I have seen posts where someone had built a custom framework and the output that he was suggesting that was coming through php was actually his code pre-processing incoming form data and I have even seen a case in the SMF forum software on phpfreaks where someone posted a series of names/symbols where the number 8, in a series 6,7,8,9, was altered to be 'eight'.

 

So, somewhere in the processing/display of this data, there is some code somewhere  (I hope you are not downloading/outputting this into Excel or something) that is responsible for the symptom.

Link to comment
Share on other sites

No frameworks, absolutely nothing special here. The Web server running the PHP is a Mac Mini, OSX 10.6, PHP 5.3.6. I'm just prototyping an app for users to upload output files from data loggers and then stuff the data into SQL Server. All the code is hand-coded HTML/PHP until I get the basic functionality down -- and there's not much of it at this point. I'm reviewing/comparing the original file and the uploaded file on a PC using the Hxd He editor (version 1.7.7.0). It's all happening on an intranet with 4 nodes attached -- a development environment. The only quirk I can imagine is that I'm seeing this on a machine using a Cisco VPN to connect in. Yet I can't believe that Cisco is a factor here. A simple "drag and drop" file transfer between the systems, over the VPN, doesn't corrupt the file, FWIW. The Mac Mini has no cron jobs or other system utilities with routinely touch files there. The download of the uploaded file happens moments after it's uploaded.

 

It's as clean and simple a test environment as I can imagine. The production environment will undoubtedly be more complex, but it may never make it there if I can't sort this out here.

 

= M =

 

 

Link to comment
Share on other sites

Another thought - how is the original csv being produced or viewed (just opening and saving a file in Excel without actually making any changes can cause the contents of the file to be altered), because what you see in a program like Excel for display, isn't necessarily what the csv data will be (i.e. date/time fields can be displayed as anything and quotes will only be around text fields that have data that contains separate characters.)

Link to comment
Share on other sites

I was going to say - I only see this kind of behavior when I open a CSV file in excel and save it.

 

Regardless, this isn't REALLY an issue. Just use a CSV parser like the one below I've made. Your original file and your modified version should both output the same

 

/**
* 
* Covert a multi-line CSV string into a 2d array. Follows RFC 4180, allows
* "cells with ""escaped delimiters""" and multi-line enclosed cells
* It assumes the CSV file is properly formatted, and doesn't check for errors
* in CSV format.
* @param string $str The CSV string
* @param string $d The delimiter between values
* @param string $e The enclosing character
* @param bool $crlf Set to true if your CSV file should return carriage return
*                   and line feed (CRLF should be returned according to RFC 4180
* @return array 
*/
function csv_explode( $str, $d=',', $e='"', $crlf=TRUE ) {
   // Convert CRLF to LF, easier to work with in regex
   if( $crlf ) $str = str_replace("\r\n","\n",$str);
   // Get rid of trailing linebreaks that RFC4180 allows
   $str = trim($str);
   // Do the dirty work
   if ( preg_match_all(
      '/(?:
         '.$e.'((?:[^'.$e.']|'.$e.$e.')*+)'.$e.'(?:'.$d.'|\n|$)
            # match enclose, then match either non-enclose or double-enclose
            # zero to infinity times (possesive), then match another enclose,
            # followed by a comma, linebreak, or string end
         |   ####### OR #######
         ([^'.$d.'\n]*+)(?:['.$d.'\n]|$)
            # match anything thats not a comma or linebreak zero to infinity
            # times (possesive), then match either a comma or a linebreak or
            # string end
      )/x', 
      $str, $ms, PREG_SET_ORDER
   ) === FALSE ) return FALSE;
   // Initialize vars, $r will hold our return data, $i will track which line we're on
   $r = array(); $i = 0;
   // Loop through results
   foreach( $ms as $m ) {
      // If the first group of matches is empty, the cell has no quotes
      if( empty($m[1]) )
         // Put the CRLF back in if needed
         $r[$i][] = isset($m[2]) ? (($crlf == TRUE) ? str_replace("\n","\r\n",$m[2]) : $m[2]) : '';
      else {
         // The cell was quoted, so we want to convert any "" back to " and
         // any LF back to CRLF, if needed
         $r[$i][] = ($crlf == TRUE) ?
            str_replace(
               array("\n",$e.$e),
               array("\r\n",$e),
               $m[1]) :
            str_replace($e.$e, $e, $m[1]);
      }
      // If the raw match doesn't have a delimiter, it must be the last in the
      // row, so we increment our line count.
      if( substr($m[0],-1) != $d )
         $i++;
   }
   return $r;

}

?>

 

 

This sample

$str1 = '"Plot Title: Mill - Logsill"
"#","Date Time, GMT-07:00","Temp, AC","Coupler Detached","Coupler Attached","Stopped","End Of File"
1,01/28/10 01:30:00 PM,2.557,Logged,,,
2,01/28/10 02:00:00 PM,2.664,,,,';

$str2 = 'Plot Title: Mill - Logsill' ."\r\n".
'#,"Date Time, GMT-07:00","Temp, AC",Coupler Detached,Coupler Attached,Stopped,End Of File' ."\r\n".
'1,1/28/10 13:30,2.557,Logged,,,' ."\r\n".
'2,1/28/10 14:00,2.664,,,,';

print_r( csv_explode($str1) );

print_r( csv_explode($str2) );

 

 

Outputs this result

Array
(
    [0] => Array
        (
            [0] => Plot Title: Mill - Logsill
        )

    [1] => Array
        (
            [0] => #
            [1] => Date Time, GMT-07:00
            [2] => Temp, AC
            [3] => Coupler Detached
            [4] => Coupler Attached
            [5] => Stopped
            [6] => End Of File
        )

    [2] => Array
        (
            [0] => 1
            [1] => 01/28/10 01:30:00 PM
            [2] => 2.557
            [3] => Logged
            [4] => 
            [5] => 
            [6] => 
        )

    [3] => Array
        (
            [0] => 2
            [1] => 01/28/10 02:00:00 PM
            [2] => 2.664
            [3] => 
            [4] => 
            [5] => 
            [6] => 
        )

)
Array
(
    [0] => Array
        (
            [0] => Plot Title: Mill - Logsill
        )

    [1] => Array
        (
            [0] => #
            [1] => Date Time, GMT-07:00
            [2] => Temp, AC
            [3] => Coupler Detached
            [4] => Coupler Attached
            [5] => Stopped
            [6] => End Of File
        )

    [2] => Array
        (
            [0] => 1
            [1] => 1/28/10 13:30
            [2] => 2.557
            [3] => Logged
            [4] => 
            [5] => 
            [6] => 
        )

    [3] => Array
        (
            [0] => 2
            [1] => 1/28/10 14:00
            [2] => 2.664
            [3] => 
            [4] => 
            [5] => 
            [6] => 
        )

)

Link to comment
Share on other sites

Thanks for your continued interest here. The file is never touched by Excel -- or anything else other than this app. It's exported from a data logger onto disk and then (ideally) uploaded from the local disk to the Web server and then the data pushed into SQL Server. I started looking at things in hex editors when things started blowing up. The original test of pushing things from the logger file to SQL Server was done when the file was simply copied onto the server in the standard "drag and drop" sort of way. And that worked fine. The problem started when I started using PHP and move_uploaded_file to get the data from the local disk to the server for the push into SQL Server. And while I can't imagine that something as standard as move_uploaded_file() is the culprit, it's the only new element in the loop. Makes me appreciate the White Queen in Alice in Wonderland believing six impossible things before breakfast.

Link to comment
Share on other sites

xyph: Thanks for both your thoughts and your code. I'll play with it tomorrow -- after my head has recovered a bit. I'm seeing the same behavior using both Safari on a Mac and FireFox on a PC -- making me yet more suspicious of the PHP code. I'll do more testing tomorrow though.

 

= M =

Link to comment
Share on other sites

Try the code on another server. Set one up at home.

 

This is NOT normal behavior. Something outside of PHP is doing this, as I've done a similar Upload CSV->Parse->Insert to Database script before, and never had issues like you're talking about.

 

The only time I see that kind of change in a CSV file is when I've used a program that parses the file, and allowed it to modify (ie Excel)

Link to comment
Share on other sites

How about a virus scanner on any of the computers involved that is causing the .csv file to be opened by the native (Excel) application every time the file shows up in the file system.

 

I reviewed (more closely) the red/green example data you posted and the differences are EXACTLY what you would get if Excel (or similar) opened and saved the file. The only things that remained double-quoted are the data values that contain a comma separator character and the format of the date/time/missing leading zeros is something you could expect Excel to do for a date/time field.

Link to comment
Share on other sites

I want to let you know, even if you don't believe it, this file is being saved that way. Uploading a file does not change its content.

 

Open the file you are going to upload in notepad++ and see what the true value of it is.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.