Jump to content

Table design: Sum+grouping values by date and time of day. (500k+ rows)

Christian F.

Recommended Posts

A little background on the project I'm working on:


I have a reasonably large number of values, over 500k rows worth. The values are generated per hour, per parent. I want to group these values by one of four time intervals: Month, week, day or hour, and get the average for this period.

The challenge comes by the fact that I need to split the values into two different blocks, based upon whether the value was generated during the "day" or "night/weekend".


So far I've come up with two different approaches to this:

  1. Have two fields in the table for saving the values, one for the "daytime" values and the other for "night time/weekend" values.
    This would allow me to run separate functions on the fields, without involving the date functions in MySQL. At the cost of having two fields for what is essentially the same type of data, and having to do the splitting in the pre-insert phase.
  2. Save all values in the same field, and then use MySQL's datetime functions with a CASE-WHEN to figure out in which block they belong.
    This approach seems to be the cleanest one in terms of database-design, but I fear it will put a lot of strain on the database. Especially since the SELECT statements will run more often than the insertion script.


So the question is if I should go for the first approach, the second, or if there is some other solution to this that I've failed to grasp? I could really do with some expert help on this one. :)


Example values:


timestamp    parent_1    parent_2    parent_3
2013.02.04 04:00:00    6,300000    1,400000    4,000000
2013.02.04 05:00:00    6,300000    1,400000    4,000000
2013.02.04 06:00:00    6,300000    1,400000    4,000000
2013.02.04 07:00:00    6,300000    1,400000    4,000000
2013.02.04 08:00:00    6,300000    1,400000    4,000000
2013.02.04 09:00:00    6,300000    1,400000    4,000000
2013.02.04 10:00:00    13,600000    9,900000    10,800000
2013.02.04 11:00:00    13,600000    9,900000    10,800000



Desired output:


Date:        Day avg. Night avg.
-- Parent_1
2013.02.04     99,5    6,3

-- Parent_2
2013.02.04     5,65    1,4

-- Parent_3
2013.02.04      4,9    4,0


Edited by Christian F.
Link to comment
Share on other sites

Thank you, Barand, it was just what I needed (I think). :)


I've come to the conclusion that here is indeed a third option, which I missed: Creating a new table, called "data_timeslot", and referencing it in the data table. Then group the data based upon this field, to separate the data properly.

Benefits of this, as I see it, is that the integrity of the data is not dependent upon the table structure, I don't have to use datetime calculations to sort each individual row, and it's possible to change/add the timezones without altering the structure of the tables themselves.


Thank you again, Barand.

Oh, and if I'm still doing it wrong, please feel free to shout at me. :P


Table definitons:


CREATE TABLE `data_timeslot` (
 `name` 		VARCHAR(30) 	NOT NULL,
) ENGINE=InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;

 `parent_id` 	INT 			UNSIGNED 		NOT NULL,
 `timeslot_id` TINYINT 		UNSIGNED 		NOT NULL,
 `timestamp` 	TIMESTAMP 		NOT NULL,
 `value` 	DECIMAL(9,6) 	NOT NULL,
 FOREIGN KEY(`parent_id`) 	REFERENCES `parent`(`id`)
 FOREIGN KEY(`timeslot_id` ) 	REFERENCES `data_timeslot`(`id`)
) ENGINE=InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;


Yes, I know you don't like backticks in the SQL code. :P


Edited by Christian F.
Link to comment
Share on other sites

It's not as much a problem with creating the foreign key, as it's the entire "parent" table missing from the code I posted above. I didn't include it as it was trivial, and not relevant to the problem at hand. Should be easy to recreate it.


While we're (still) on this subject, I was hoping if someone could perhaps tell me if there's a way to make this query use indices? Right now no matter what indices I add, and in what combination, it still insist upon using "temporary" and "filesort".

    $Query = <<<OutSQL
SELECT s.`id`, s.`name`, DATE_FORMAT(d.`timestamp`, '$SQLFormat') AS date_res,
   AVG(d.`percent`) AS `avg`, t.`name` AS timeslot
FROM `$TableData` AS d
INNER JOIN `$TableParent` AS s ON s.`id` = d.`station_id`
INNER JOIN `$TableTime` AS t ON t.`id` = d.`timeslot_id`
WHERE d.`timestamp` >= %s AND d.`timestamp` <= %s $Where
GROUP BY d.`parent_id`, date_res, d.`timeslot_id`
ORDER BY s.`name` ASC,d.`timestamp` ASC


The joins themselves are good, but it's the fetching from the primary table that causes issues.



| id | select_type | table | type | possible_keys                | key       | key_len | ref               | rows | Extra                           |
|  1 | SIMPLE      | s     | ALL  | PRIMARY                    | NULL         | NULL    | NULL              |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | d     | ref  | timeslot_id,timestamp,id_time_slot,id_slot   | id_time_slot | 4       |       s.id |    1 | Using where                     |
|  1 | SIMPLE      | t     | ALL  | PRIMARY                    | NULL         | NULL    | NULL              |    2 | Using where; Using join buffer  |


Edited by Christian F.
Link to comment
Share on other sites

Jessica: Heh... Thanks for the tip, but unfortunately I don't know nearly enough about MySQL indices and other internals to start to manually mess with them. At least not for now. It's bookmarked though, for future reading. :)


Barand: Thanks, that seems to have done something, but it didn't actually use the timestamp index..? Instead it chose the "parent_id, timeslot" index. Which, upon reading the EXPLAIN result makes sense:


| id | select_type | table | type | possible_keys                    | key     | key_len | ref                 | rows | Extra                           |
|  1 | SIMPLE      | t     | ALL  | PRIMARY                          | NULL   | NULL  | NULL                  |    2 | Using temporary; Using filesort |
|  1 | SIMPLE      | s     | ALL  | PRIMARY                          | NULL   | NULL  | NULL                  |    5 | Using join buffer               |
|  1 | SIMPLE      | d     | ref  | timeslot_id,timestamp,id_time_slot,id_slot,idx_timestamp | id_slot | 5   |      database.s.id,database.t.id   |   49 | Using where                     |

"t" is the alias for the timeslot table (only 2 rows in it), and "s" the alias for the parent table. "d" is the data table, with all of the rows.


Upon having a second look: I have no idea why adding the idx_timestamp index made any difference. Seeing as I had an index on the timestamp already..?

Oh, well.. Guess I'll just have to check the performance once in a while, to see how it's holding up. At least it's using an index for the main data table.


The format of the timestamp is actually one out of four available, so the formatting is (unfortunately) necessary.


Jazzman: Thanks for trying to help, but unfortunately that guide was a bit too basic for my needs. Also, not sure how applicable it is for a MySQL database.

Sentiment is appreciated though. :)

Edited by Christian F.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.