Jump to content

Help with very large table and indexes


KurkPeterman

Recommended Posts

Hi,

 

I'm currently using MySQL client version: 5.1.37

 

I have a table with millions of employee records that get created on a daily basis.

 

CREATE TABLE employee ( 
group TINYINT NOT NULL , 
id INT NOT NULL , 
type TINYINT NOT NULL , 
subtype TINYINT NOT NULL , 
category TINYINT NOT NULL , 
dt DATE NOT NULL , 
revenue FLOAT NULL DEFAULT NULL 
) ENGINE = MYISAM ; 
ALTER TABLE employee ADD UNIQUE my_unique_key ( group , id , type , subtype , category , dt ); 

 

Each day a unique record is created for each of the following ranges:

 

group : 0 to 17 (with varying #s of ids per group)

id : ~250 total group:id combinations

type : 0 to 4

subtype : 0 to 2

category : 0 to 13

 

e.g.

INSERT INTO employee ( group, id, type, subtype, category, dt, revenue ) 
VALUES ( 0, 123, 1, 0, 5, "2010-12-05", 5500.5); 

 

(group/id combinations) * (type * subtype * category) = records each day

250 * ( 5 * 3 * 13 ) = ~50,000 records each day

 

I need to be able to efficiently calculate the SUM of revenue for a given (group, id, type, subtype, category) combination and a set date range. I don't currently have the correct indexes or approach and am looking for help in improving my efficiency. An example query is:

 

SELECT id, category, SUM(revenue) 
FROM employee 
WHERE group = 0 
AND id IN(1,2,3) 
AND type = 0 
AND subtype = 0 
AND dt >= "2010-12-03" 
AND dt <= "2010-12-05" 
GROUP BY id, category 

 

The range for 'dt' may be very large (up to about 9 months max). My questions are:

 

i. What indexes would be most efficient for the query above?

ii. Since I calculate the total each day, would it be more efficient to look up the most recent total and add that to today's value (instead of summing the values for the whole range each time).

iii. If the approach mentioned above is taken (of looking up most recent total and adding to daily record), how would I handle a situation where only a daily record exists (i.e. when it is the first day for that employee and no existing total record exists to add).

 

I've tried to provide as much information as possible, please let me know if there is anything I left out that is pertinant. Any assistance is greatly appreciated. Thanks!

Link to comment
Share on other sites

i. Create indexes in hierarchies. If you know you'll search a specific set of fields then you can create an index over all of them together.

 

Take group and id for example. Sounds like you'll rarely ever be searching on just the id - always group or group+id. So have an index across both:

INDEX (group, id)

That will include searches on just group, as well as group+id. But not id by itself.

 

type, subtype, and category may be similar: you'll probably search on just type, type+subtype, or type+subtype+category.

INDEX (type, subtype, category)

 

You'll likely want an index on the date separately.

INDEX (dt)

 

Don't think you'll want an index on the revenue.

 

That will create a relatively large index table, but if you're dealing with tens of thousands of rows per day then I don't think disk space is an issue.  The upside is that it's faster than using indexes on each field individually.

 

ii. Naturally. Having to recalculate a billion rows is worse than recalculating a hundred rows.

 

iii. Depends how you do the totaling each day. Where are the totals kept? How are they updated?

Link to comment
Share on other sites

Thanks so much for the reply.  I'm going to try creating indexes based on your recommendations.  In regards to the second and third question, the total values are stored as follows:

 

type : 0 = daily record (1-4 are various date ranges)

subtype : 0 = total value

 

So as an example, let's say I have the following records:

 

INSERT INTO employee ( group, id, type, subtype, category, dt, revenue ) 
VALUES ( 0, 1, 0, 0, 0, "2010-12-03", 100); 
VALUES ( 0, 1, 0, 0, 0, "2010-12-04", 200); 
VALUES ( 0, 1, 0, 0, 0, "2010-12-05", 150); 

 

So, for example let's say type:1 is 2010-12-03 through current date.  My table would have the following totals entries:

 

INSERT INTO employee ( group, id, type, subtype, category, dt, revenue ) 
VALUES ( 0, 1, 1, 0, 0, "2010-12-03", 100); 
VALUES ( 0, 1, 1, 0, 0, "2010-12-04", 300); 
VALUES ( 0, 1, 1, 0, 0, "2010-12-05", 450); 

 

I plan on using an INSERT INTO ( my totals query) ON DUPLICATE KEY UPDATE

 

For each day, I have a list of id's for each group such that my WHERE clause will always say:

 

WHERE group = #
AND id IN(#,#,#...)

 

I would like to be able to select all of the employee records for today and add them to the most recent total record if available.  I can get the select statements to get me both datasets:

 

Records for today:

 

SELECT *
FROM employee 
WHERE group = 0 
AND id IN(1,2,3) 
AND type = 0 
AND subtype = 0 
AND dt = "2010-12-05" 
GROUP BY id, category

 

 

Most Recent Totals:

 

SELECT *
FROM employee 
WHERE group = 0 
AND id IN(1,2,3) 
AND subtype = 0 
GROUP BY id, type, category
ORDER BY dt ASC

 

My problem is that when I get both groups, there are some new employees (or the first time when there is no totals record for an employee) and I end up not being able to connect the data.  For example:

 

In my previous examples, I have employee group 0:id 0 who has a daily record for "2010-12-03" that matches the first query, however, since there is no totals record yet, that group:id combination doesn't show up in the second query.  Is there a way to just add the values from the first query to the second if the record exists? (I'm pretty sure I can handle getting the record inserted properly so long as I can get the data selected for all group:id combinations on a particular day).

 

Link to comment
Share on other sites

I think I finally found it out. Here would be an example of a query to get the most recent daily record (type=0) and the most recent totals record (type=1) for the date range "2010-12-04" through "2010-12-07" for group 0, pid list (1,2,3) [the date being "2010-12-07" in this example]:

 

SELECT day_record.*, total_record.* 
FROM employee day_record 
LEFT JOIN (
   SELECT e_tot.* FROM employee e_tot 
   INNER JOIN (
      SELECT group, id, type, subtype, category, MAX(dt) as max_dt
      FROM employee
      WHERE group = 0
      AND id IN(1,2,3)
      AND type = 1
      AND subtype = 0
      AND dt < "2010-12-07"
      AND dt >= "2010-12-05"
      GROUP BY id, category ) e_max
   ON e_tot.group = e_max.group
   AND e_tot.id = e_max.id
   AND e_tot.type = e_max.type
   AND e_tot.subtype = e_max.subtype
   AND e_tot.category = e_max.category
   AND e_tot.dt = e_max.max_dt
) total_record
ON day_record.group = total_record.group
AND day_record.id = total_record.id
AND total_record.type = 1
AND day_record.subtype = total_record.subtype
AND day_record.category = total_record.category
WHERE day_record.group = 0
AND day_record.id IN(1,2,3)
AND day_record.type = 0
AND day_record.subtype = 0
AND day_record.dt = "2010-12-07"
GROUP BY day_record.pid, day_record.category

 

The first INNER JOIN selects all totals records that exist in my range prior to today.  The LEFT JOIN joins that totals record to all the daily records.  This should result in a NULL entry to total_record when there is no available total (but should still give me the daily record).

 

I think this will work.  If someone with better SQL skills than I would doublecheck, I'd greatly appreciate it.  Thanks!

Link to comment
Share on other sites

There is a typo where I say pid instead of id in the query and I am unable to edit it.  Sorry for the repost, but here is the correct query:

 

SELECT day_record.*, total_record.* 
FROM employee day_record 
LEFT JOIN (
   SELECT e_tot.* FROM employee e_tot 
   INNER JOIN (
      SELECT group, id, type, subtype, category, MAX(dt) as max_dt
      FROM employee
      WHERE group = 0
      AND id IN(1,2,3)
      AND type = 1
      AND subtype = 0
      AND dt < "2010-12-07"
      AND dt >= "2010-12-05"
      GROUP BY id, category ) e_max
   ON e_tot.group = e_max.group
   AND e_tot.id = e_max.id
   AND e_tot.type = e_max.type
   AND e_tot.subtype = e_max.subtype
   AND e_tot.category = e_max.category
   AND e_tot.dt = e_max.max_dt
) total_record
ON day_record.group = total_record.group
AND day_record.id = total_record.id
AND total_record.type = 1
AND day_record.subtype = total_record.subtype
AND day_record.category = total_record.category
WHERE day_record.group = 0
AND day_record.id IN(1,2,3)
AND day_record.type = 0
AND day_record.subtype = 0
AND day_record.dt = "2010-12-07"
GROUP BY day_record.id, day_record.category

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.