Table design: Sum+grouping values by date and time of day. (500k+ rows)

Christian F. · February 21, 2013

A little background on the project I'm working on:

I have a reasonably large number of values, over 500k rows worth. The values are generated per hour, per parent. I want to group these values by one of four time intervals: Month, week, day or hour, and get the average for this period.

The challenge comes by the fact that I need to split the values into two different blocks, based upon whether the value was generated during the "day" or "night/weekend".

So far I've come up with two different approaches to this:

Have two fields in the table for saving the values, one for the "daytime" values and the other for "night time/weekend" values.
This would allow me to run separate functions on the fields, without involving the date functions in MySQL. At the cost of having two fields for what is essentially the same type of data, and having to do the splitting in the pre-insert phase.
Save all values in the same field, and then use MySQL's datetime functions with a CASE-WHEN to figure out in which block they belong.
This approach seems to be the cleanest one in terms of database-design, but I fear it will put a lot of strain on the database. Especially since the SELECT statements will run more often than the insertion script.

So the question is if I should go for the first approach, the second, or if there is some other solution to this that I've failed to grasp? I could really do with some expert help on this one.

Example values:

timestamp    parent_1    parent_2    parent_3
2013.02.04 04:00:00    6,300000    1,400000    4,000000
2013.02.04 05:00:00    6,300000    1,400000    4,000000
2013.02.04 06:00:00    6,300000    1,400000    4,000000
2013.02.04 07:00:00    6,300000    1,400000    4,000000
2013.02.04 08:00:00    6,300000    1,400000    4,000000
2013.02.04 09:00:00    6,300000    1,400000    4,000000
2013.02.04 10:00:00    13,600000    9,900000    10,800000
2013.02.04 11:00:00    13,600000    9,900000    10,800000

Desired output:

Date:        Day avg. Night avg.
-- Parent_1
2013.02.04     99,5    6,3
.....

-- Parent_2
2013.02.04     5,65    1,4
.....

-- Parent_3
2013.02.04      4,9    4,0
.....

Edited February 21, 2013 by Christian F.

Barand · February 21, 2013

http://forums.phpfreaks.com/topic/273634-best-way-to-set-up-tables-when-multiple-values/?do=findComment&comment=1408360

Christian F. · February 21, 2013

Thank you, Barand, it was just what I needed (I think).

I've come to the conclusion that here is indeed a third option, which I missed: Creating a new table, called "data_timeslot", and referencing it in the data table. Then group the data based upon this field, to separate the data properly.

Benefits of this, as I see it, is that the integrity of the data is not dependent upon the table structure, I don't have to use datetime calculations to sort each individual row, and it's possible to change/add the timezones without altering the structure of the tables themselves.

Thank you again, Barand.

Oh, and if I'm still doing it wrong, please feel free to shout at me.

Table definitons:

CREATE TABLE `data_timeslot` (
 `id` 			TINYINT 		UNSIGNED 		AUTO_INCREMENT,
 `name` 		VARCHAR(30) 	NOT NULL,
 PRIMARY KEY(`id`)
) ENGINE=InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;

CREATE TABLE `data` (
 `id` 			BIGINT 			UNSIGNED 		AUTO_INCREMENT,
 `parent_id` 	INT 			UNSIGNED 		NOT NULL,
 `timeslot_id` TINYINT 		UNSIGNED 		NOT NULL,
 `timestamp` 	TIMESTAMP 		NOT NULL,
 `value` 	DECIMAL(9,6) 	NOT NULL,
 PRIMARY KEY(`id`),
 FOREIGN KEY(`parent_id`) 	REFERENCES `parent`(`id`)
 		ON UPDATE CASCADE
 		ON DELETE RESTRICT,
 FOREIGN KEY(`timeslot_id` ) 	REFERENCES `data_timeslot`(`id`)
 		ON UPDATE CASCADE
 		ON DELETE RESTRICT
) ENGINE=InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;

Yes, I know you don't like backticks in the SQL code.

Edited February 21, 2013 by Christian F.

Barand · February 21, 2013

Much better, I could live with that

BTW, is the "spoiler" a standard feature on this board. If so, how?

Jessica · February 21, 2013

Use [ spoiler] just like [ code]

Christian F. · February 21, 2013

Hehe, good to know.

The spoiler tags are indeed a standard feature of the board, easily added by using

around the content you want to spoil.

jazzman1 · February 22, 2013

Christian, did you get an error message when you try to create the second table - data.

I've got an error and I'm thinking you have a problem creating the foreign key.

jazzman1 · February 22, 2013

Ops..... you have one more table, named - parent.

Sorry about that

Christian F. · February 22, 2013

It's not as much a problem with creating the foreign key, as it's the entire "parent" table missing from the code I posted above. I didn't include it as it was trivial, and not relevant to the problem at hand. Should be easy to recreate it.

While we're (still) on this subject, I was hoping if someone could perhaps tell me if there's a way to make this query use indices? Right now no matter what indices I add, and in what combination, it still insist upon using "temporary" and "filesort".

    $Query = <<<OutSQL
SELECT s.`id`, s.`name`, DATE_FORMAT(d.`timestamp`, '$SQLFormat') AS date_res,
   AVG(d.`percent`) AS `avg`, t.`name` AS timeslot
FROM `$TableData` AS d
INNER JOIN `$TableParent` AS s ON s.`id` = d.`station_id`
INNER JOIN `$TableTime` AS t ON t.`id` = d.`timeslot_id`
WHERE d.`timestamp` >= %s AND d.`timestamp` <= %s $Where
GROUP BY d.`parent_id`, date_res, d.`timeslot_id`
ORDER BY s.`name` ASC,d.`timestamp` ASC
OutSQL;

The joins themselves are good, but it's the fetching from the primary table that causes issues.

Explain:

+----+-------------+-------+------+--------------------------------------------+--------------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys                | key       | key_len | ref               | rows | Extra                           |
+----+-------------+-------+------+--------------------------------------------+--------------+---------+-------------------+------+---------------------------------+
|  1 | SIMPLE      | s     | ALL  | PRIMARY                    | NULL         | NULL    | NULL              |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | d     | ref  | timeslot_id,timestamp,id_time_slot,id_slot   | id_time_slot | 4       |       s.id |    1 | Using where                     |
|  1 | SIMPLE      | t     | ALL  | PRIMARY                    | NULL         | NULL    | NULL              |    2 | Using where; Using join buffer  |
+----+-------------+-------+------+--------------------------------------------+--------------+---------+-------------------+------+---------------------------------+

Edited February 22, 2013 by Christian F.

Jessica · February 22, 2013

I've never done it but a coworker told me it was possible a few weeks ago.

http://dev.mysql.com...ndex-hints.html

Might help.

Edit: and according to the examples you can do it on the first table not just the joined ones.

Edited February 22, 2013 by Jessica

Barand · February 22, 2013

An index on the timestamp column should help.

Also, there shouldn't be any need to format the timestamp if you just want standard yyyy-mm--dd format. Just select DATE(timestamp) to get the date portion

jazzman1 · February 22, 2013

I think that oracle's optimizing guide will help you a lot - http://docs.oracle.com/html/A95912_01/wn32tune.htm#i631457

Christian F. · February 23, 2013

Jessica: Heh... Thanks for the tip, but unfortunately I don't know nearly enough about MySQL indices and other internals to start to manually mess with them. At least not for now. It's bookmarked though, for future reading.

Barand: Thanks, that seems to have done something, but it didn't actually use the timestamp index..? Instead it chose the "parent_id, timeslot" index. Which, upon reading the EXPLAIN result makes sense:

+----+-------------+-------+------+----------------------------------------------------------+---------+---------+-------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys                    | key     | key_len | ref                 | rows | Extra                           |
+----+-------------+-------+------+----------------------------------------------------------+---------+---------+-------------------------------------+------+---------------------------------+
|  1 | SIMPLE      | t     | ALL  | PRIMARY                          | NULL   | NULL  | NULL                  |    2 | Using temporary; Using filesort |
|  1 | SIMPLE      | s     | ALL  | PRIMARY                          | NULL   | NULL  | NULL                  |    5 | Using join buffer               |
|  1 | SIMPLE      | d     | ref  | timeslot_id,timestamp,id_time_slot,id_slot,idx_timestamp | id_slot | 5   |      database.s.id,database.t.id   |   49 | Using where                     |
+----+-------------+-------+------+----------------------------------------------------------+---------+---------+-------------------------------------+------+---------------------------------+

"t" is the alias for the timeslot table (only 2 rows in it), and "s" the alias for the parent table. "d" is the data table, with all of the rows.

Upon having a second look: I have no idea why adding the idx_timestamp index made any difference. Seeing as I had an index on the timestamp already..?

Oh, well.. Guess I'll just have to check the performance once in a while, to see how it's holding up. At least it's using an index for the main data table.

The format of the timestamp is actually one out of four available, so the formatting is (unfortunately) necessary.

Jazzman: Thanks for trying to help, but unfortunately that guide was a bit too basic for my needs. Also, not sure how applicable it is for a MySQL database.

Sentiment is appreciated though.

Edited February 23, 2013 by Christian F.

PrecisionGW1 · February 23, 2013

I have my site, I am now ready to build the table and setup the database. Could really use a hand. Any advice?

C

trq · February 23, 2013

I have my site, I am now ready to build the table and setup the database. Could really use a hand. Any advice?

C

Yeah. Open your own thread with some more specific questions.

Sign In

Table design: Sum+grouping values by date and time of day. (500k+ rows)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information