Filling the gaps

Kieran Menor · March 3, 2011

I have a table with two columns: postal code and age (in years). I would like to have a query that lists the amount of rows for every age for every postal code. As in, something like this:

Postal CodeAgeCount

10007050

100071100

10007230

20007025

2000710

200072200

... and so forth.

At first, it seemed easy enough, but then I realized that if no rows with a certain age exists for a certain postal code, it would not show in the result set. In the example above, the row with postal code 2000 and age 71 would be missing. Such gaps can be quite annoying when you work with the data later on.

So, my question is, how do I fill the gaps?

Muddy_Funster · March 3, 2011

could you show your actual current query?

Kieran Menor · March 3, 2011

I don't think it has any relevance, but here it is:

SELECT `zipcode` , (
YEAR( NOW( ) ) - CAST( SUBSTRING( `birthday` , 1, 4 ) AS SIGNED )
) AS `age` , COUNT( * ) AS `count`
FROM `p_member`
GROUP BY `zipcode` , `age`
ORDER BY `zipcode` , `age`

It returns a result similar to the example in my first post, but as mentioned (and expected), it doesn't list rows where the count is 0.

As I said, what I want is a list of all ages for each postal code, and the amount of rows that match, even if it is 0.

kickstart · March 3, 2011

Hi

Think to do this you need a list of all the zip codes and all the ages. Which can be done by selecting distinct zip codes and ages from the existing table and doing a cross join between them

Something like this would work

SELECT a.zipcode, b.age, COUNT(c.id)
FROM (SELECT DISTINCT zipcode FROM p_member) a
CROSS JOIN (SELECT DISTINCT (YEAR( NOW( ) ) - CAST( SUBSTRING( `birthday` , 1, 4 ) AS SIGNED )) AS `age` FROM p_member) b
LEFT OUTER JOIN p_member c ON a.zipcode = c.zipcode AND b.age = (YEAR( NOW( ) ) - CAST( SUBSTRING( c.birthday , 1, 4 ) AS SIGNED ))
GROUP BY a.zipcode, b.age
ORDER BY a.zipcode, b.age

This is assuming you have a unique id field on the table p_member (change the field name in COUNT(c.id) to the actual unique field name).

All the best

Keith

Kieran Menor · March 3, 2011

Awesome, it seems to work. Thanks a lot.

Although, it takes about a minute to execute. This doesn't matter much now as I only need to pull out the data every once in a while. The table only has about 6000 rows now, though, so I imagine it would get much worse later. Any ideas for optimization would be much appreciated.

kickstart · March 3, 2011

Hi

First thing I would do is to extract the zip code into a separate table, so you have a table of unique zip codes with p_member just storing the id field of the zip code.

I would also put the date of birth into a date field so you can use datediff to get the age.

Possible to also just generate a list of years for the 2nd subselect rather than trying the extract the range of years (this would also remove the situation where you will get a gap at the moment if nobody on your list was born one year).

If you let me have your table definition I will have a play around.

All the best

Keith

Kieran Menor · March 3, 2011

I imagine that you don't need all the irrelevant data fields. I didn't design this table myself, and now that I look at it, "birthday" is actually a date field. So, okay, the table looks something like this:

CREATE TABLE `members` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `zipcode` varchar(55) NOT NULL DEFAULT '',
  `birthday` date NOT NULL DEFAULT '0000-00-00',
  PRIMARY KEY (`id`)
) ENGINE=MyISAM ;

"zipcode" is always a 4 digit number, so you can definitely use another data type to increase efficiency. Also, I see that the table currently has no indices (other than PRIMARY).

kickstart · March 3, 2011

Hi

First set up a table of zip codes

CREATE TABLE IF NOT EXISTS `zipcodes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `zipcode` varchar(25) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=11 ;

Populate that with all the zip codes.

Add a column to the members table called zipcodeid. Populate this with the id from the zip code table for the members zip code. Put an index on this column.

Create a table of years.

CREATE TABLE IF NOT EXISTS `years` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `year` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `year` (`year`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=21 ;

Populate that with the years you care about.

Then have your SQL as

SELECT a.zipcode, YEAR(now()) - b.year, COUNT( c.id )
FROM zipcodes a
CROSS JOIN years b
LEFT OUTER JOIN members c ON a.id = c.zipcodeid
AND b.year = YEAR( c.birthday )
GROUP BY a.zipcode, YEAR(now()) - b.year
ORDER BY a.zipcode, YEAR(now()) - b.year

Note this is a bit of a cheat as it is basing age on just the year.

All the best

Keith

Sign In

Filling the gaps

Recommended Posts

Kieran Menor

Link to comment

Share on other sites

Muddy_Funster

Link to comment

Share on other sites

Kieran Menor

Link to comment

Share on other sites

kickstart

Link to comment

Share on other sites

Kieran Menor

Link to comment

Share on other sites

kickstart

Link to comment

Share on other sites

Kieran Menor

Link to comment

Share on other sites

kickstart

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information