How to store a list?

NotionCommotion · May 15, 2016

Yes, I know being tricky and storing multiple data in a single column is usually always a bad idea. But is there ever a time for it?

I have an email application that the user enters their default CC list under their user profile, and then on a per email sent bases, their actual CC list.

For instance, their default email CC list might be: mary@example.com; bob@example.com; linda@example.com;

But when they compose their email, they change the email CC list to: beth@example.com; bob@example.com;

Would it ever be a good idea to store the data lists as semi-column separated values, or should it always be moved to another table?

Adam · May 15, 2016

By "moved to another table" I presume you just mean normalised? It really depends on what kind of functionality you're going to provide. If it's just a basic send mail wrapper then you probably don't need to worry, given it will be sent in the CC header in exactly that format.

If however you want to provide a more advanced address book, and/or provide some kind of filtering of sent mails, then I would probably normalise the data. Normalising the data complicates the code of course, so unless you have the need to do it I wouldn't bother -- KISS.

kicken · May 15, 2016

A basic rule of thumb is if you ever need to do something such as search for a particular value within the list, then you normalize it into it's own table with one value per row.

If you only ever work with the data as a whole, then usually storing it as a single value will work ok. I tend to prefer storing a JSON structure rater than a simple list, but either way would work fine.

NotionCommotion · May 24, 2016

Thanks Adam and Kicken,

If you ask a SQL only person, they will say always normalize to at least 3 levels, and if you have performance issues, consider doing less, however, I tend to agree with both of you. Guess I really have to understand what future use might be. If I later wanted to find all emails which "bob" was in the CC list, storing JSON wouldn't be good, however, I don't expect I will ever need to do so, and if it later becomes a need, could always restructure the data.

Are there rules of thumb how many pieces of data could/should be stored in JSON? For instance, for another application, I wish to store several current environmental monitored data for a bunch of buildings, the number and type of environmental data for each building will change and is not the same for every building, and the data will be save and retrieve on a per-building basis. Doing it this way makes it very nice and I would expect efficient if I never wanted to get a list of all room temperatures, but only wanted to get all current data from Building #2. But what if Building #2 had 1,000 data points? Probably not a problem, right?

building_id   data
1             {"Room 321 temperature":72, "Room 222 pressure": 0.02}
2             {"Room 111 humidity": 0.43, "Room 111 pressure": 0.01, "Room 333 temperature": 68}
....

kicken · May 24, 2016

If you can fit it into a column then it'll be fine. I wouldn't worry about the length of the data, just how it's used.

In the situation you describe, I would normalize it out into separate tables, not because a building may have 1000 data points, but because it seems likely to me that you would want to (if not now, in the future) do work with the individual data points.

For example, what if you decided to keep a history of data so you could track changes over time? With just single json column, it would constantly grow in size as data was captured, likely exceeding the limit of the column's data type eventually. Trying to find data for a particular time (ie, most recent data) would also be overly complicated.

As you also mentioned, what if you want to see just the current temperature of each building? Not easily done with just SQL, and even with PHP you'd waste a lot of resources passing back huge json strings just to extract a single number.

When I store simple json (or whatever) strings it's usually more of configuration type thing. For example in one application with a customizable menu system I stored individual menu item configurations as a simple json string. Each menu item could be of various types, each of with required different parameters. There was never any need to query these parameters individually. If such a requirement had existed, then having another key/value table for the configuration would have been appropriate.

ginerjm · May 24, 2016

The whole idea of an RDBMS is to make it easy to access data quickly and efficiently. To put like data elements into the same table column means that you will ALWAYS have to finagle it to access just one of those data items which conflicts with the reason you are using the RDBMS in the first place.

Edited May 24, 2016 by ginerjm

Jacques1 · May 24, 2016

I agree with ginerjm.

Using JSON or comma-separated lists in SQL is a hack, especially when the database system has no native support for this data type (like MySQL). Sometimes hacks are valid, but they should be the last resort. If you need JSON a lot, that indicates you either a) aren't using SQL correctly or b) don't really want SQL (there are many other models, e. g. document stores like MongoDB).

This isn't just a theoretical question. Going back to your initial example: If the e-mail addresses are properly stored as rows in a table, adding and deleting addresses is trivial. If you try the same thing with a comma-separated list, you suddenly have to worry about race conditions. If two processes load the list, one of them deletes an entry, the other adds an entry, then one of the two updates will be lost (depending on which process comes last). Note that database systems with native list support (like PostgreSQL) don't have this particular problem.

Edited May 24, 2016 by Jacques1

NotionCommotion · May 24, 2016

Thanks Kicken, Ginerjm, and Jacques1,

For the temperature/humidity application, the data is used two ways:

Quickly return last monitored values in a building building.
Return aggregate values and trend values.

My plan was to use a time series database such as InfluxDB for the aggregate values and trends. I've also considered changing to PostgreSQL for this as I understand it is more applicable than MySQL, but "think" InfluxDB (or similar) is more applicable.

Getting the most current values will occur more often, and while I could just use the one database and return the last value, I thought maybe it would be much quicker also storing these values in a single column as JSON. The application would never try to add/delete/modify individual data in the JSON, but would just re-write the entire string every time a group of data is received for a given building, and a race condition will not occur. Albeit, a given request would never need all the most current values in a given building, but might need 40 out of the total 1,000 values.

Don't know if this makes any more sense...

ginerjm · May 24, 2016

Thirty years I've been in this - never heard of such a scheme.

IMHO - you'll live to regret this.

NotionCommotion · May 25, 2016

Hi ginerjm, I expect you are right about regretting doing so, but don't know why. Mind sharing your reasons? Thanks

Jacques1 · May 25, 2016

The problem is that you're trying to implement a non-relational data concept with a relational database. This is like choosing an object-oriented programming language and then writing purely procedural code all crammed into a single method. Sure, that's technically possible. But it means you're working against the language/database rather than with it. You get none of its strengths but all of its downsides.

SQL expects you to store your data in tables. That's the precondition for all SQL features (data integrity, performance, the ACID properties etc.). If you violate this contract and start cramming your data into a couple of TEXT fields, the database system cannot really promise anything. There's no data integrity, the performance may be abysmal, and concurrent updates can go horribly wrong depending on the concrete scenario. Even worse, you now have a weird mixture of a database language you don't really need (SQL) and a lot of application code to re-implement data processing features for your own model.

If the tool doesn't match your concept, one of them is wrong. Personally, I'd go back to the original problem and ask a simple question: What's the best way to store the data? If the answer is “as JSON documents”, forget about SQL and use a document store. If the answer is “as relations”, forget about JSON and use an SQL database.

ginerjm · May 25, 2016

I agree with Jacques1. I also think I gave you the reason for my/our position in my previous post. It goes against all well-defined principles of data storage to do what you are doing.

Do you really think you are qualified to re-invent the (database) wheel?

Answer me/us this. Suppose you need to count one of the many unique data values that you have embedded in multiple records in your database, ie, for some reason you want to analyze the values you have. In a properly normalized database it is simply a quick and dirty sql query. For you?

This idea of properly normalized databases is 1) not new. It's been around since my college days; and 2) nobody else is coming up with a "new and improved" method for data storage, at least not in the logical sense. What makes you think you have to do it?

Edited May 25, 2016 by ginerjm

Sign In

How to store a list?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information