T-SQL (SS2K8) :: Merging Pseudo Duplicate Records

Apr 21, 2014

We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.

In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.

drop table mytbl
create table mytbl (
effective_date smalldatetime,
expiration_date smalldatetime,
employee_id int,
employee_name varchar(30),

[code]....

In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.

We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.

alter table mytbl
drop column last_login_date
select *
from mytbl
order by employee_id, effective_date

Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):

create table mytbl2 (
effective_date smalldatetime,
expiration_date smalldatetime,
employee_id int,
employee_name varchar(30),
employee_dept int

[code]...

Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).

My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.

As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.

View 2 Replies

T-SQL (SS2K8) :: Merging Pseudo Duplicate Records

T-SQL (SS2K8) :: Query To Avoid Duplicate Records (across Columns)

T-SQL (SS2K8) :: Identifying Potential Duplicate Records In A Given Table?

T-SQL (SS2K8) :: Delete And Merge Duplicate Records From Joined Tables?

Merging Duplicate Rows

T-SQL (SS2K8) :: Merging Non Overlapping Timedates

T-SQL (SS2K8) :: Merging From One DB To Another If Have Identity Column On Both Database

T-SQL (SS2K8) :: Merging Intervals With Identical Data

Help Merging Records?

Merging Several Records Of A Field

Looking For Recommended Approach To Merging Records

SQL Server 2008 :: Comparing / Merging Records In Single Table?

SQL Server 2014 :: Selecting And Merging Records For Singular Complete Record

T-SQL (SS2K8) :: Renumbering Remaining Records In A Table After Some Records Deleted

Pseudo Tables

T-SQL (SS2K8) :: Remove A Simple Duplicate

Pseudo Code Advice

Trigger, Inserted Pseudo Table?

T-SQL (SS2K8) :: Ranking Duplicate Contacts Using Multiple Columns

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Duplicate Records

Records Duplicate When Edited...?

Duplicate Records On Database