Jump to content

Filling the gaps


Kieran Menor

Recommended Posts

I have a table with two columns: postal code and age (in years). I would like to have a query that lists the amount of rows for every age for every postal code. As in, something like this:

 

Postal CodeAgeCount

10007050

100071100

10007230

20007025

2000710

200072200

... and so forth.

 

At first, it seemed easy enough, but then I realized that if no rows with a certain age exists for a certain postal code, it would not show in the result set. In the example above, the row with postal code 2000 and age 71 would be missing. Such gaps can be quite annoying when you work with the data later on.

 

So, my question is, how do I fill the gaps?

Link to comment
Share on other sites

I don't think it has any relevance, but here it is:

 

SELECT `zipcode` , (
YEAR( NOW( ) ) - CAST( SUBSTRING( `birthday` , 1, 4 ) AS SIGNED )
) AS `age` , COUNT( * ) AS `count`
FROM `p_member`
GROUP BY `zipcode` , `age`
ORDER BY `zipcode` , `age`

 

It returns a result similar to the example in my first post, but as mentioned (and expected), it doesn't list rows where the count is 0.

 

As I said, what I want is a list of all ages for each postal code, and the amount of rows that match, even if it is 0.

Link to comment
Share on other sites

Hi

 

Think to do this you need a list of all the zip codes and all the ages. Which can be done by selecting distinct zip codes and ages from the existing table and doing a cross join between them

 

Something like this would work

 

SELECT a.zipcode, b.age, COUNT(c.id)
FROM (SELECT DISTINCT zipcode FROM p_member) a
CROSS JOIN (SELECT DISTINCT (YEAR( NOW( ) ) - CAST( SUBSTRING( `birthday` , 1, 4 ) AS SIGNED )) AS `age` FROM p_member) b
LEFT OUTER JOIN p_member c ON a.zipcode = c.zipcode AND b.age = (YEAR( NOW( ) ) - CAST( SUBSTRING( c.birthday , 1, 4 ) AS SIGNED ))
GROUP BY a.zipcode, b.age
ORDER BY a.zipcode, b.age

 

This is assuming you have a unique id field on the table p_member (change the field name in COUNT(c.id) to the actual unique field name).

 

All the best

 

Keith

Link to comment
Share on other sites

Awesome, it seems to work. Thanks a lot.

 

Although, it takes about a minute to execute. This doesn't matter much now as I only need to pull out the data every once in a while. The table only has about 6000 rows now, though, so I imagine it would get much worse later. Any ideas for optimization would be much appreciated.

Link to comment
Share on other sites

Hi

 

First thing I would do is to extract the zip code into a separate table, so you have a table of unique zip codes with p_member just storing the id field of the zip code.

 

I would also put the date of birth into a date field so you can use datediff to get the age.

 

Possible to also just generate a list of years for the 2nd subselect rather than trying the extract the range of years (this would also remove the situation where you will get a gap at the moment if nobody on your list was born one year).

 

If you let me have your table definition I will have a play around.

 

All the best

 

Keith

Link to comment
Share on other sites

I imagine that you don't need all the irrelevant data fields. I didn't design this table myself, and now that I look at it, "birthday" is actually a date field. So, okay, the table looks something like this:

 

CREATE TABLE `members` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `zipcode` varchar(55) NOT NULL DEFAULT '',
  `birthday` date NOT NULL DEFAULT '0000-00-00',
  PRIMARY KEY (`id`)
) ENGINE=MyISAM ;

 

"zipcode" is always a 4 digit number, so you can definitely use another data type to increase efficiency. Also, I see that the table currently has no indices (other than PRIMARY).

Link to comment
Share on other sites

Hi

 

First set up a table of zip codes

 

CREATE TABLE IF NOT EXISTS `zipcodes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `zipcode` varchar(25) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=11 ;

 

Populate that with all the zip codes.

 

Add a column to the members table called zipcodeid. Populate this with the id from the zip code table for the members zip code. Put an index on this column.

 

Create a table of years.

 

CREATE TABLE IF NOT EXISTS `years` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `year` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `year` (`year`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=21 ;

 

Populate that with the years you care about.

 

Then have your SQL as

 

SELECT a.zipcode, YEAR(now()) - b.year, COUNT( c.id )
FROM zipcodes a
CROSS JOIN years b
LEFT OUTER JOIN members c ON a.id = c.zipcodeid
AND b.year = YEAR( c.birthday )
GROUP BY a.zipcode, YEAR(now()) - b.year
ORDER BY a.zipcode, YEAR(now()) - b.year

 

Note this is a bit of a cheat as it is basing age on just the year.

 

All the best

 

Keith

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.