Jump to content

Validating file uploads


NotionCommotion

Recommended Posts

When a file is uploaded, $_FILES will be populated with the name, type, and size (which are all provided by the browser and in the body and not headers, right?) as well as the tmp_name and errors (which is presumably set by PHP).

If browser provided size is different than what filesize() reports, should I care or just go with filesize()?

What about similar question but for mime type?

  • Some file types result in false positives such as the following and I will want to accept those as being valid, but should I reject them as being invalid if if they are actually different?
  • Regarding detecting these multiple valid mime types, is there a PHP function to do so or any good composer/etc packages?
  • Also, I am thinking I should never bother saving the browser provided mime type because it is based on the individual browser and/or operating system the user happened to be using at the time, agree?
printf('extention: %s type: %s (provided) %s (fileinfo) FILEINFO_EXTENSION: %s<br>'.PHP_EOL,
    pathinfo($_FILES['expenseFile']['name'])['extension'],
    $_FILES['expenseFile']['type'],
    (new \finfo(FILEINFO_MIME_TYPE))->file($_FILES['expenseFile']['tmp_name']),
    (new \finfo(FILEINFO_EXTENSION))->file($_FILES['expenseFile']['tmp_name'])
);
extention: csv type: application/vnd.ms-excel (provided) application/csv (fileinfo) FILEINFO_EXTENSION: ???
extention: gz type: application/x-gzip (provided) application/gzip (fileinfo) FILEINFO_EXTENSION: ???
extention: js type: text/javascript (provided) text/plain (fileinfo) FILEINFO_EXTENSION: ???
extention: css type: text/css (provided) text/plain (fileinfo) FILEINFO_EXTENSION: ???
extention: yaml type: application/octet-stream (provided) text/plain (fileinfo) FILEINFO_EXTENSION: ???
extention: ini type: application/octet-stream (provided) text/plain (fileinfo) FILEINFO_EXTENSION: ???

There is also the issue of having file extensions that matches the actual file type and I wish to reject those that do not.  finfo's FILEINFO_EXTENSION constant provides solutions for some but very few at least with my version of magic.mime database.  Any good approaches or 3rd party packages that can manage this?

extention: ods type: application/vnd.oasis.opendocument.spreadsheet (provided) application/vnd.oasis.opendocument.spreadsheet (fileinfo) FILEINFO_EXTENSION: ods
extention: png type: image/png (provided) image/png (fileinfo) FILEINFO_EXTENSION: png
extention: jpg type: image/jpeg (provided) image/jpeg (fileinfo) FILEINFO_EXTENSION: jpeg/jpg/jpe/jfif

Thanks!

Link to comment
Share on other sites

4 hours ago, NotionCommotion said:

If browser provided size is different than what filesize() reports, should I care or just go with filesize()?

It can't be: the size is not just the size of the file but the amount of content that the browser sent to the server. If this did not match what the request actually had then there would have been problems.

 

4 hours ago, NotionCommotion said:

What about similar question but for mime type?

That's the big one.

MIME type detection is naive and optimistic: it assumes that if the file has a few bytes in a certain location then the entire file is that one type. It won't be able to detect files with mixed content (think PHP code buried in the middle of some HTML) or files using containers (OpenDocument files are ZIP archives) or many types of text file formats. It can accurately detect audio and video data as well as "unique" binary formats.

That's where you have to enter with some specific knowledge to make decisions.

 

4 hours ago, NotionCommotion said:
  • Some file types result in false positives such as the following and I will want to accept those as being valid, but should I reject them as being invalid if if they are actually different?

The detected types are correct, they're just not what you expected or wanted.

Windows particularly tends to identify files by extension, then equate those extensions with MIME types according to whatever software is installed. For example, having Office/Excel will tell the system that .csv files are vnd-ms.excel because... well, because that's what it's been doing for a very long time, but point is that a Windows browser will happily report vnd.ms-excel because that's what it knows the file as. That's especially useful for text files. Linux too will frequently deem a file a certain type according to the extension and only use MIME detection as a fallback.

And I agree with that. It's a huge pain to try to deduce MIME type or the correct file extension just from the contents. So don't do that. Instead, in the general case, validate that the MIME type you detect is consistent with the extension - and optionally with the reported MIME type.
(That's the general case. For more specific cases, like you only want to support images, sometimes it can be done reliably with only MIME types.)

And above all else, if you want to store arbitrary files, install a virus scanner or two.

 

4 hours ago, NotionCommotion said:
  • Also, I am thinking I should never bother saving the browser provided mime type because it is based on the individual browser and/or operating system the user happened to be using at the time, agree?

Mostly disagree. While you should assume the client is malicious, in the real world that's very often not the case, and throwing away data because it might be incorrect is hurting youself.

 

4 hours ago, NotionCommotion said:

There is also the issue of having file extensions that matches the actual file type and I wish to reject those that do not.  finfo's FILEINFO_EXTENSION constant provides solutions for some but very few at least with my version of magic.mime database.  Any good approaches or 3rd party packages that can manage this?

extention: ods type: application/vnd.oasis.opendocument.spreadsheet (provided) application/vnd.oasis.opendocument.spreadsheet (fileinfo) FILEINFO_EXTENSION: ods
extention: png type: image/png (provided) image/png (fileinfo) FILEINFO_EXTENSION: png
extention: jpg type: image/jpeg (provided) image/jpeg (fileinfo) FILEINFO_EXTENSION: jpeg/jpg/jpe/jfif

But how do you know it does not match? It's easy to pick examples like images, but what about HTML with some PHP code buried in the middle? You'll receive a .php extension but detection will say it's .htm/html.

Link to comment
Share on other sites

2 hours ago, requinix said:

It can't be: the size is not just the size of the file but the amount of content that the browser sent to the server. If this did not match what the request actually had then there would have been problems.

There is both the Content-Length in the request header and the size value in $_FILES.  Aren't they two separate things?

 

2 hours ago, requinix said:

That's the big one.

MIME type detection is naive and optimistic: it assumes that if the file has a few bytes in a certain location then the entire file is that one type. It won't be able to detect files with mixed content (think PHP code buried in the middle of some HTML) or files using containers (OpenDocument files are ZIP archives) or many types of text file formats. It can accurately detect audio and video data as well as "unique" binary formats.

The detected types are correct, they're just not what you expected or wanted.

And I agree with that. It's a huge pain to try to deduce MIME type or the correct file extension just from the contents. So don't do that. Instead, in the general case, validate that the MIME type you detect is consistent with the extension - and optionally with the reported MIME type.
(That's the general case. For more specific cases, like you only want to support images, sometimes it can be done reliably with only MIME types.)

But how do you know it does not match? It's easy to pick examples like images, but what about HTML with some PHP code buried in the middle? You'll receive a .php extension but detection will say it's .htm/html.

My purpose is to allow a user (organization) to limit the types of files outside users can upload based on the software the user/organization has.  Almost everyone has software for PDF's, various images, various Microsoft documents, etc, but there is also file types such as AutoCAD, various BIM formats, and others.  ZIP archives will need to be supported to allow OpenDOcument files and they add some complexity as they contain other files, but suppose they can be opened and inspected prior to saving.

Regarding validating that the detected MIME type is consistent with the extension, seems like this is a common need and there would be some de facto standard opensource package but I haven't found it.

Will have to give this one more thought...

 

2 hours ago, requinix said:

Mostly disagree. While you should assume the client is malicious, in the real world that's very often not the case, and throwing away data because it might be incorrect is hurting youself.

Guess I can store it but don't know what to do with it.  When later providing the file for download, would I want to use this value or the detected value?  What if two identical files were uploaded but with different clients and were given different MIME types?  Would I return them with different MIME types?

Link to comment
Share on other sites

24 minutes ago, NotionCommotion said:

There is both the Content-Length in the request header and the size value in $_FILES.  Aren't they two separate things?

The Content-Length in the request header (if there even is one) does not describe the file. It describes the entire request.

Take a look at how multipart/form-data requests are structured and that might help explain what's going on.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST

 

24 minutes ago, NotionCommotion said:

Regarding validating that the detected MIME type is consistent with the extension, seems like this is a common need and there would be some de facto standard opensource package but I haven't found it.

Could very well be. But these things are also frequently dependent upon the application itself.

Maybe what you need is not so much a library but a curated database you can read.

 

24 minutes ago, NotionCommotion said:

Guess I can store it but don't know what to do with it.  When later providing the file for download, would I want to use this value or the detected value?

Assuming you validated that the provided type was correct, because if not then you shouldn't be storing it at all, then you would use it instead of whatever type you tried to guess it was.

 

32 minutes ago, NotionCommotion said:

What if two identical files were uploaded but with different clients and were given different MIME types?  Would I return them with different MIME types?

Sure. Why would it matter if they were different?

Link to comment
Share on other sites

On 12/16/2021 at 5:49 PM, requinix said:

Maybe what you need is not so much a library but a curated database you can read.

Yes, I think so.  Any suggestions on where to find one?

On 12/16/2021 at 5:49 PM, requinix said:

Assuming you validated that the provided type was correct, because if not then you shouldn't be storing it at all, then you would use it instead of whatever type you tried to guess it was.

Sure. Why would it matter if they were different?

Thank you, I was originally thinking differently, but now fully agree.

Link to comment
Share on other sites

  • 2 weeks later...
On 12/16/2021 at 5:49 PM, requinix said:

Maybe what you need is not so much a library but a curated database you can read.

On 12/19/2021 at 6:49 AM, NotionCommotion said:

Yes, I think so.  Any suggestions on where to find one?

On 12/20/2021 at 4:46 PM, requinix said:

No clue.

While not curated, I suppose I could build my own using https://www.iana.org/assignments/media-types/media-types.xhtml as a reference, or perhaps https://github.com/jshttp/mime-db will be a better starting point.  I seems, however, that this would be a fairly common need in PHP applications and there would be some composer package which would be easier to maintain.

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.