Scooby08 Posted September 1, 2012 Share Posted September 1, 2012 MySQL 5.1.46 The Goal: On each row, check if there are any records having the same ssn within the previous 10 minutes using the "created_on" datetime field.. If a record *is* found within the previous 10 minutes, don't collect the current row and move onto the next row and do the same thing. If a record *is not* found, collect the current row and move onto the next row and do the same thing. SHOW CREATE TABLE posts_inbound | posts_inbound | CREATE TABLE `posts_inbound` ( `id` int(11) unsigned NOT NULL auto_increment, `publisher_id` int(11) unsigned default NULL, `transaction_id` varchar(32) default NULL, `api_key` varchar(64) default NULL, `score_version` char(10) default NULL, `score` int(11) NOT NULL default '0', `sold` char(1) default NULL, `price` decimal(5,2) NOT NULL default '0.00', `post_status` varchar(32) default NULL, `post_errors` text, `postback_status` varchar(32) default NULL, `postback_errors` text, `reference_id` varchar(64) default NULL, `affiliate_id` varchar(64) default NULL, `affiliate_sid` varchar(64) default NULL, `lead_id` varchar(64) default NULL, `buyer_id` varchar(64) default NULL, `buyer_sid` varchar(64) default NULL, `redirected` tinyint(1) NOT NULL default '0', `source_url` varchar(64) default NULL, `ip_address` varchar(16) default NULL, `user_agent` varchar(255) default NULL, `time_stamp` datetime NOT NULL default '0000-00-00 00:00:00', `is_mobile` tinyint(1) NOT NULL default '0', `loan_amount` decimal(6,2) NOT NULL default '0.00', `first_name` varchar(64) default NULL, `last_name` varchar(64) default NULL, `email` varchar(64) default NULL, `gender` char(1) default NULL, `date_of_birth` char(10) default NULL, `marital_status` char(1) default NULL, `drivers_license_number` varchar(30) default NULL, `drivers_license_state` char(2) default NULL, `social_security_number` char(9) default NULL, `is_military` char(1) default NULL, `best_contact_time` char(1) default NULL, `mother_maiden_name` varchar(64) default NULL, `home_phone` varchar(32) default NULL, `cell_phone` varchar(32) default NULL, `address` varchar(40) default NULL, `city` varchar(32) default NULL, `state` char(2) default NULL, `zip` char(5) default NULL, `rent_or_own` tinyint(1) default NULL, `monthly_rent_or_mortgage` decimal(6,2) NOT NULL default '0.00', `time_at_current_address` tinyint(2) default NULL, `previous_address` varchar(40) default NULL, `previous_city` varchar(32) default NULL, `previous_state` char(2) default NULL, `previous_zip` char(5) default NULL, `employer_name` varchar(64) default NULL, `employer_address` varchar(40) default NULL, `employer_city` varchar(32) default NULL, `employer_state` char(2) default NULL, `employer_zip` char(5) default NULL, `work_phone` varchar(32) default NULL, `time_at_job` tinyint(2) default NULL, `monthly_income` decimal(7,2) NOT NULL default '0.00', `income_source` tinyint(1) default NULL, `pay_frequency` tinyint(1) default NULL, `first_pay_day` char(10) default NULL, `second_pay_day` char(10) default NULL, `last_pay_amount` decimal(7,2) NOT NULL default '0.00', `is_direct_deposit` tinyint(1) default NULL, `bank_name` varchar(64) default NULL, `bank_phone` varchar(32) default NULL, `bank_account_type` char(1) default NULL, `bank_account_length` tinyint(2) default NULL, `bank_routing_number` char(9) default NULL, `bank_account_number` varchar(17) default NULL, `reference_first_name` varchar(64) default NULL, `reference_last_name` varchar(64) default NULL, `reference_phone` varchar(32) default NULL, `reference_relation` varchar(20) default NULL, `time_start` double default NULL, `time_end` double default NULL, `time` double default NULL, `created_date` date NOT NULL default '0000-00-00', `created_on` datetime NOT NULL default '0000-00-00 00:00:00', `updated_on` datetime NOT NULL default '0000-00-00 00:00:00', PRIMARY KEY (`id`), KEY `publisher_id` (`publisher_id`), KEY `transaction_id` (`transaction_id`), KEY `api_key` (`api_key`), KEY `created_date` (`created_date`), KEY `social_security_number` (`social_security_number`), KEY `idx_bank_account_type` (`bank_account_type`), KEY `idx_email` (`email`), KEY `ssn_id` (`social_security_number`,`id`) ) ENGINE=MyISAM AUTO_INCREMENT=7874256 DEFAULT CHARSET=latin1 | Here's sort of a pseudo query of what I'm trying to collect: SELECT * FROM posts_inbound WHERE (social_security_number NOT BETWEEN (created_on - INTERVAL 10 MINUTE) AND created_on) Here's a query that works, but is very very slow: SELECT social_security_number AS ssn, created_on AS timeframe_start, (SELECT COUNT(id) FROM posts_inbound WHERE social_security_number=ssn AND (created_on >= (timeframe_start - INTERVAL 10 MINUTE) AND created_on < timeframe_start)) AS dup_count FROM posts_inbound WHERE (created_date BETWEEN '2012-08-09' AND '2012-08-10') GROUP BY social_security_number, id HAVING dup_count = 0 The explain for the above query: mysql> explain SELECT social_security_number AS ssn, created_on AS timeframe_start, (SELECT COUNT(id) FROM posts_inbound WHERE social_security_number=ssn AND (created_on >= (timeframe_start - INTERVAL 10 MINUTE) AND created_on < timeframe_start)) AS dup_count FROM posts_inbound WHERE (created_date BETWEEN '2012-08-09' AND '2012-08-10') GROUP BY social_security_number, id HAVING dup_count = 0; +----+--------------------+---------------+-------+-------------------------------+------------------------+---------+------+--------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+---------------+-------+-------------------------------+------------------------+---------+------+--------+-----------------------------+ | 1 | PRIMARY | posts_inbound | range | created_date | created_date | 3 | NULL | 198107 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | posts_inbound | ref | social_security_number,ssn_id | social_security_number | 10 | func | 9 | Using where | +----+--------------------+---------------+-------+-------------------------------+------------------------+---------+------+--------+-----------------------------+ I'm game to either write an entirely new query or optimize the one above that works.. Thanks to all who reply! Quote Link to comment Share on other sites More sharing options...
Barand Posted September 1, 2012 Share Posted September 1, 2012 Something like this, using a left join to check for records within 10 mins SELECT a.* FROM posts_inbound a LEFT JOIN posts_inbound b ON a.social_security_number = b social_security_number AND b.created_on > a.created_on - INTERVAL 10 MINUTE WHERE b.social_security_number IS NULL Quote Link to comment Share on other sites More sharing options...
Barand Posted September 1, 2012 Share Posted September 1, 2012 Correction to above: SELECT a.* FROM posts_inbound a LEFT JOIN posts_inbound b ON a.social_security_number = b social_security_number AND b.created < a.created_on AND b.created_on > a.created_on - INTERVAL 10 MINUTE WHERE b.social_security_number IS NULL Quote Link to comment Share on other sites More sharing options...
Scooby08 Posted September 1, 2012 Author Share Posted September 1, 2012 Thanks for replying Barand! The query you provided is keeping the Joined records.. We do not want to keep those, just want to keep posts_inbound "a" if there are no records with the same ssn within the 10 minute timeframe, otherwise move on and keep nothing from that record.. Want to end up with 1 record per ssn per 10 mintues from that record.. I'm having troubles trying to explain my objective clearly.. Maybe this might help: if the current rows social_security_number does not exist between now and 10 minutes ago (no prior posts within 10 mins) keep the current row in the result else do not keep the row and move on to the next row and repeat Trying to filter out 10 minute uniques.. Quote Link to comment Share on other sites More sharing options...
Barand Posted September 1, 2012 Share Posted September 1, 2012 Once I corrected a couple of typos it worked for me SELECT a.* FROM posts_inbound a LEFT JOIN posts_inbound b ON a.social_security_number = b.social_security_number AND b.created_on < a.created_on AND b.created_on > a.created_on - INTERVAL 10 MINUTE WHERE b.social_security_number IS NULL; Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.