Storing Images

Adamhumbug · February 19, 2024

Hi All,

I have an application with users - users need to have a photo stored on their account. In the past i have literally saved the image in a folder on the server but i wondering what the best, most efficient, most secure, most space saving method would be to deal with user pictures.

Any suggestions here would be welcome.

Thanks

requinix · February 20, 2024

For simple applications, storing the file on a server is typically the easiest approach: file gets uploaded, you stick it somewhere that's accessible by the web server, and you can use a direct URL straight to the image.

Breaks down when you have multiple servers. Doesn't support cases where you don't want just anybody on the internet to be able to see the image (though in this case you probably do want that). And backing up those images can be a pain in the ass.

The alternative is storing the image in your database. Has overhead, like having to know the image file type and needing to provide some sort of script that can render the image, but the benefits may be worth it.

Adamhumbug · February 20, 2024

7 hours ago, requinix said:

For simple applications, storing the file on a server is typically the easiest approach: file gets uploaded, you stick it somewhere that's accessible by the web server, and you can use a direct URL straight to the image.

Breaks down when you have multiple servers. Doesn't support cases where you don't want just anybody on the internet to be able to see the image (though in this case you probably do want that). And backing up those images can be a pain in the ass.

The alternative is storing the image in your database. Has overhead, like having to know the image file type and needing to provide some sort of script that can render the image, but the benefits may be worth it.

Thanks for this - there are likely going to be 10's of thousands of users. Does this make the database approach more favourable. I actually dont want the general internet to be able to see the images, they are used for ID purposes.

requinix · February 20, 2024

6 hours ago, Adamhumbug said:

Thanks for this - there are likely going to be 10's of thousands of users.

lol. Databases are built to handle far more than that. Even with image blobs.

6 hours ago, Adamhumbug said:

Does this make the database approach more favourable. I actually dont want the general internet to be able to see the images, they are used for ID purposes.

If you don't want them visible then you could still store them as files - just outside the web root. But it does mean that particular advantage doesn't matter to you mostly.

gizmola · February 20, 2024

Scalability is also a concern. Here is some food for thought.

What is important for scalability is only the maximum number of concurrent users per second. You need tools to help you simulate this type of load before you really can get an understanding of any potential bottlenecks or what concurrent load your system can handle and still operate.

Assumption #1: you have a monolithic server.

What this means is that your application will run in a stack where everything (other than the database) will run on the same server.

Sessions will use the filesystem
images will be stored on the filesystem
Database can be running on the same server or not
Reverse proxy will run on the same server (assuming you are employing one).

This sort of setup is typical, and has limited scalability, and suffers from contention issues when load increases. If this is how your production will run, you at least want to learn a bit about how your setup performs as load increases. Everything takes memory, and databases don't work well if they run out of memory or resources. A frequent mistake people make in setting up a database is to provide inadequate memory allocation. Databases are pretty much always given their own dedicated machine to run, for anything that isn't a hobby or non-commercial endeavor.

Advantages to storing data on the filesystem:

Filesystem already buffers data and is highly efficient
The cost of returning data is only IO + bandwidth
1. Stored in a database, a read of a blob requires IO + network delivery to application + network delivery to client
2. Stored in the db, blob storge and retrieval is often non-optimal
3. Stored in the db, blobs balloon the size of the database dataset, making database caching less effective, and can also slow down queries that touch the table(s) with the blobs in them.

In terms of security, you will need to store the files in a location that is not within web space (ie. under the webroot). It is easy enough to write the routine you need that returns the data from a location on the filesystem.

Storing in a database does have this advantage, in terms of the potential for scalability:

Making the application scalable is simpler, as you can have 1-n application servers connecting to the same database (and this is the 1st level typical of a move to a scalable architecture from what started as a monolithic app)
This is another reason to start with a reverse proxy even with a monolithic architecture.
1. Once you have app server #2 you have broken sessions
  1. You can move sessions into a database or distributed cache like memcached or redis
    1. Moving sessions into a DB can add a lot of load to the db
  2. You can use the reverse proxy to pin sessions (ie. "sticky sessions") to a specific app server. Usually this is done using a cookie that the reverse proxy adds and subsequently uses to keep traffic coming back to the same app server once an initial connection is made.

The other ways to scale an application that uses images stored on a filesystem:

Use an NFS server or NAS appliance
- I've worked for a number of companies with large amounts of data and files. In some cases, the problem could be solved with a NAS device, which servers can then mount using NFS as a client.
Use a file storage service
- A good example of this is AWS S3. Some ISP's have their own alternative, and there are even consumer grade services like Dropbox you can make use of if you look into it. Whether or not this is smart or feasible comes again down to the application infrastructure, but as a rule of thumb, you are more likely to have the app experience feel similar if the object storage is local/within your hosting infrastructure. For example, if you had a server hosted by Linode, they offer an S3 compatible alternative service, and I'd look into that.
  - The downside here is additional costs for the storage of the assets and possibly egress costs to retrieve. There are a lot of different "Object Storage" companies out there, and they are intrinsically scalable, so it wouldn't hurt to do some research.
    - Take this list with a grain of salt, but here's a way to start looking at the possible vendors: https://www.g2.com/categories/object-storage-solutions

Adamhumbug · February 21, 2024

12 hours ago, gizmola said:

Scalability is also a concern. Here is some food for thought.

What is important for scalability is only the maximum number of concurrent users per second. You need tools to help you simulate this type of load before you really can get an understanding of any potential bottlenecks or what concurrent load your system can handle and still operate.

Assumption #1: you have a monolithic server.

What this means is that your application will run in a stack where everything (other than the database) will run on the same server.

Sessions will use the filesystem

images will be stored on the filesystem

Database can be running on the same server or not

Reverse proxy will run on the same server (assuming you are employing one).

This sort of setup is typical, and has limited scalability, and suffers from contention issues when load increases. If this is how your production will run, you at least want to learn a bit about how your setup performs as load increases. Everything takes memory, and databases don't work well if they run out of memory or resources. A frequent mistake people make in setting up a database is to provide inadequate memory allocation. Databases are pretty much always given their own dedicated machine to run, for anything that isn't a hobby or non-commercial endeavor.

Advantages to storing data on the filesystem:

Filesystem already buffers data and is highly efficient

The cost of returning data is only IO + bandwidth
Stored in a database, a read of a blob requires IO + network delivery to application + network delivery to client

Stored in the db, blob storge and retrieval is often non-optimal

Stored in the db, blobs balloon the size of the database dataset, making database caching less effective, and can also slow down queries that touch the table(s) with the blobs in them.

In terms of security, you will need to store the files in a location that is not within web space (ie. under the webroot). It is easy enough to write the routine you need that returns the data from a location on the filesystem.

Storing in a database does have this advantage, in terms of the potential for scalability:

Making the application scalable is simpler, as you can have n-1 application servers connecting to the same database (and this is the 1st level typical of a move to a scalable architecture from what started as a monolithic app)

This is another reason to start with a reverse proxy even with a monolithic architecture.
Once you have app server #2 you have broken sessions
You can move sessions into a database or distributed cache like memcached or redis
Moving sessions into a DB can add a lot of load to the db

You can use the reverse proxy to pin sessions (ie. "sticky sessions") to a specific app server. Usually this is done using a cookie that the reverse proxy adds and subsequently uses to keep traffic coming back to the same app server once an initial connection is made.

The other ways to scale and application that uses images stored on a filesystem:

Use an NFS server or NAS appliance
I've worked for a number of companies with large amounts of data and files. In some cases, the problem could be solved with a NAS device, which servers can then mount using NFS as a client.

Use a file storage service
A could example of this is AWS S3. Some ISP's have their own alternative, and there are even consumer grade services like Dropbox you can make use of if you look into it. Whether or not this is smart or feasible comes again down to the application infrastructure, but as a rule of thumb, you are more likely to have the app experience feel similar if the object storage is local/within your hosting infrastructure. For example, if you had a server hosted by Linode, they offer an S3 compatible alternative service, and I'd look into that.
The downside here is additional costs for the storage of the assets and possibly egress costs to retrieve. There are a lot of different "Object Storage" companies out there, and they are intrinsically scalable, so it wouldn't hurt to do some research.
Take this list with a grain of salt, but here's a way to start looking at the possible vendors: https://www.g2.com/categories/object-storage-solutions

This is a a really great answer and i appreciate it. Again, alot of this is very new to me and there will be a lot of research required. I think the s3 option could be a good one for me as i plan to host on AWS.

You have also raised a lot of very good points that i am yet to look into.

When, would you suggest in the application build do you start testing load. I am pretty early in the process at the minute but as i have never load tested an application before, i wonder when would be the best time to start?

gizmola · February 24, 2024

I wouldn't worry about load testing until you have an MVP. Unit tests are much more important in the development phase.

With that said, a fast and simple way of running a load test, is to use the apache benchmark (ab) program, that is part of the apache server. It's a simple cli program that you can use to send a bunch of requests using multiple socket connections. You can also do some minimal authentication and post requests with it.

Beyond ab, there's a lot of other tools like Siege and JMeter, that have different strengths and use cases.

Adamhumbug · February 27, 2024

On 2/24/2024 at 7:37 AM, gizmola said:

I wouldn't worry about load testing until you have an MVP. Unit tests are much more important in the development phase.

With that said, a fast and simple way of running a load test, is to use the apache benchmark (ab) program, that is part of the apache server. It's a simple cli program that you can use to send a bunch of requests using multiple socket connections. You can also do some minimal authentication and post requests with it.

Beyond ab, there's a lot of other tools like Siege and JMeter, that have different strengths and use cases.

Thanks for this - i will look into what you have suggested when i get further through the project.

Sign In

Storing Images

Recommended Posts

Adamhumbug

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Adamhumbug

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

gizmola

Link to comment

Share on other sites

Adamhumbug

Link to comment

Share on other sites

gizmola

Link to comment

Share on other sites

Adamhumbug

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information