Jump to content

[SOLVED] Counting records from joined tables without duplicates


MegamanNL

Recommended Posts

Hi guys!

This is my first post on PHP Freaks, please be gentle.. ;)

 

I have a similar situation in a few applications so I've made sample tables and query to explain the issue I'm facing.

 

Using MySQL 5.0.51a-3ubuntu5.4

 

Tables:

CREATE TABLE `issues` (
 `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
 `name` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
 `status` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`)
)ENGINE=MyISAM
AUTO_INCREMENT=6 CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';
COMMIT;
/* Data for the `issues` table  (Records 1 - 5) */
INSERT INTO `issues` (`id`, `name`, `status`) VALUES 
 (1, 'IE Crash', 'closed'),
 (2, 'Mail problem', 'open'),
 (3, 'Dead laptop', 'closed'),
 (4, 'Blue screens', 'closed'),
 (5, 'Driver update', 'open');

CREATE TABLE `issues_log` (
 `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
 `issue_id` INTEGER(11) UNSIGNED NOT NULL,
 `status_date` DATE NOT NULL,
 `status` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`)
)ENGINE=MyISAM
AUTO_INCREMENT=15 CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';
COMMIT;
/* Data for the `issues_log` table  (Records 1 - 12) */
INSERT INTO `issues_log` (`id`, `issue_id`, `status_date`, `status`) VALUES 
 (1, 1, '2009-02-09', 'open'),
 (2, 1, '2009-02-11', 'closed'),
 (3, 2, '2009-02-09', 'open'),
 (8, 4, '2009-02-09', 'open'),
 (6, 4, '2009-02-10', 'open'),
 (7, 4, '2009-02-10', 'closed'),
 (9, 4, '2009-02-09', 'closed'),
 (10, 3, '2009-02-08', 'open'),
 (11, 3, '2009-02-11', 'closed'),
 (12, 5, '2009-02-09', 'open'),
 (13, 5, '2009-02-10', 'closed'),
 (14, 5, '2009-02-11', 'open');

 

Now, what I'm trying to retrieve are statistics for the closed issues. So I'd like to know day-by-day how many issues are closed but without counting reopened issues.

 

Query:

SELECT
  il.status_date AS sdt, COUNT(i.id) AS cnt
FROM
  issues AS i
  LEFT JOIN issues_log AS il ON il.issue_id = i.id
WHERE
  i.status = 'closed' AND il.status='closed'
GROUP BY sdt
ORDER BY sdt ASC

 

Explain output:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	il	ALL					9	Using where; Using temporary; Using filesort
1	SIMPLE	i	ALL	PRIMARY,id				3	Using where

 

Query output:

sdt	cnt
09-02-2009	1
10-02-2009	1
11-02-2009	2

 

Desired output:

sdt	cnt
10-02-2009	1
11-02-2009	2

 

The record with id 9 for 09-02-2009 (dd-mm-yyy) should be omitted.

I've tried selecting from just the issues_log table, but then I'm stuck when trying to sort the records so that I get only the newest one. Grouping/distinct/etc. will count record 9, in stead of 7 which is the most recent one.

 

Is there any method other then looping that would give the desired result?

Link to comment
Share on other sites

A few things:

 

1) You don't need to have two duplicate indexes for "id" -- PRIMARY KEY will suffice, drop the other one.

 

2) You should really index the fields you're using for JOIN.

 

3) because it's a left join, you need to make any conditions from the where clause of the left join-ed table to the ON clause.

Link to comment
Share on other sites

A few things:

1) You don't need to have two duplicate indexes for "id" -- PRIMARY KEY will suffice, drop the other one.

 

Hi fenway, thanks for your reply.

EMS MySQL manager always creates those double indexes automatically, not sure why. I usually remove them from production DBs.

 

2) You should really index the fields you're using for JOIN.

 

These were just sample tables to simplify the issue I'm trying to resolve. The real tables have a lot more fields and are indexed properly. But thanks for the reminder.

 

3) because it's a left join, you need to make any conditions from the where clause of the left join-ed table to the ON clause.

 

So you mean this would be better?

SELECT
il.status_date AS sdt, COUNT(i.id) AS cnt
FROM
issues AS i
LEFT JOIN issues_log AS il ON i.id = il.issue_id
WHERE
i.status = 'closed' AND il.status='closed'
GROUP BY sdt
ORDER BY sdt ASC

 

I still can't figure out how I can make the counting work the way I'd like. Or am I misunderstanding you completely?

 

Link to comment
Share on other sites

Well, unless I'm missing something, your query works exactly like the first one I posted. When I run it I get:

sdt   cnt
09-02-2009   1
10-02-2009   1
11-02-2009   2

 

What I'd like to see is the result without the duplicate row for issue_id 4. In other words just 10-02 and 11-02.

I might be able to group by issue_id, but then I won't have the most recent date... :(

Link to comment
Share on other sites

I'm sorry... I'm quite lost now. Is there a counting issue or a grouping issue?

If only I knew... I think a bit of both.

 

And why should certain rows be excluded?

When I graph the number of closed issues by date, the total of my query exceeds the total of all closed issues.

In the resultset I get 1+1+2 = 4 closed issues... But there are only 3 issues with closed status in my `issues` table. That's caused by the issue with id 4. That one has been opened, closed and then opened and closed again one day later. I only want to count the records from `issues_log` with the most recent date for each closed issue.

 

Your sample data doesn't match your output, either.

Ahum? It should, I actually created the sample table and copied everything from there...  :D

 

CREATE TABLE `issues` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `status` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `status` (`status`)
)ENGINE=MyISAM
AUTO_INCREMENT=6 CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';

CREATE TABLE `issues_log` (
  `id` INTEGER(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `issue_id` INTEGER(11) UNSIGNED NOT NULL,
  `status_date` DATE NOT NULL,
  `status` VARCHAR(255) COLLATE utf8_general_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `issue_id` (`issue_id`),
  KEY `status` (`status`)
)ENGINE=MyISAM
AUTO_INCREMENT=15 CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';

INSERT INTO `issues` (`id`, `name`, `status`) VALUES 
  (1,'IE Crash','closed'),
  (2,'Mail problem','open'),
  (3,'Dead laptop','closed'),
(4,'Blue screens','closed'),
  (5,'Driver update','open');
COMMIT;

INSERT INTO `issues_log` (`id`, `issue_id`, `status_date`, `status`) VALUES 
  (1,1,'2009-02-09','open'),
  (2,1,'2009-02-11','closed'),
  (3,2,'2009-02-09','open'),
  (6,4,'2009-02-10','open'),
(7,4,'2009-02-10','closed'),
  (8,4,'2009-02-09','open'),
(9,4,'2009-02-09','closed'),
  (10,3,'2009-02-08','open'),
  (11,3,'2009-02-11','closed'),
  (12,5,'2009-02-09','open'),
  (13,5,'2009-02-10','closed'),
  (14,5,'2009-02-11','open');
COMMIT;

 

SELECT
   il.status_date AS sdt, COUNT(i.id) AS cnt
FROM
   issues AS i
   LEFT JOIN issues_log AS il ON (i.id = il.issue_id AND il.status = 'closed')
WHERE
   i.status = 'closed'
GROUP BY sdt
ORDER BY sdt ASC;

 

Your query returns:

 

sdt	cnt
09-02-2009	1
10-02-2009	1
11-02-2009	2

 

The query counts record ids 2, 7, 9 and 11 from `issues_log`. And I want it to count only 2, 7 and 11.

Because 9 is a duplicate issue on a different date (records 7 and 9 from `issues_log` both belong to the issue with id 4).

 

I hope my description is clearer now.

 

 

Link to comment
Share on other sites

Oh, I see now -- we need to use the most recent closed status.

That would be a correct assessment.

 

BTW, why do you have two status flags?

One of the tables in production has many thousands of records and select queries running continuously. It seems faster to me to add a small field compared to using a join. I'd rather not use join if I don't have to. But what do I know...

Link to comment
Share on other sites

Well, this will retrieve the "desired" rows, correct?

 

SELECT il.*
FROM issues_log AS il
INNER JOIN
(
SELECT issue_id, MAX(status_date) AS status_date
FROM issues_log
WHERE status = 'closed'
GROUP BY issue_id
) AS sub USING( issue_id, status_date )

 

Shouldn't be too hard to count and join it back to issues table.

Link to comment
Share on other sites

Working query for one of the applications (field names are different but similar table structure):

 

SELECT COUNT(LinkId) AS InstallCount, DATE(InDateTime) AS InstallDate
FROM ots_report_status AS ors
INNER JOIN
(
SELECT LinkId, MAX(InDateTime) AS InDateTime
FROM ots_report_status
WHERE LinkTable = 'ots_device' 
GROUP BY LinkId
) AS sub USING( LinkId, InDateTime ) WHERE ors.To_Ots_System_Status_Id = 1 GROUP BY DATE(InDateTime)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.