We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.
In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.
In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.
We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.
alter table mytbl
drop column last_login_date
select *
from mytbl
order by employee_id, effective_date
Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):
Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).
My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.
As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.
any useful SQL Queries that might be used to identify lists of potential duplicate records in a table?
For example I have Client Database that includes a table dbo.Clients. This table contains various columns which could be used to identify possible duplicate records, such as Surname | Forenames | DateOfBirth | NINumber | PostalCode etc. . The data contained in these columns is not always exactly the same due to differences caused by user data entry; so some records may have missing data from some of the columns and there could be spelling differences too. Like the following examples:
1 | Smith | John Raymond | NULL | NI990946B | SW12 8TQ 2 | Smith | John | 06/03/1967 | NULL | SW12 8TQ 3 | Smith | Jon Raymond | 06/03/1967 | NI 99 09 46 B | SW12 8TQ
The problem is that whilst it is easy for a human being to review these 3 entries and conclude that they are most likely the same Client entered in to the database 3 times; I cannot find a reliable way of identifying them using a SQL Query.
I've considered using some sort of concatenation to a new column, minus white space and then using a "WHERE column_name LIKE pattern" query, but so far I can't get anything to work well enough. Fuzzy Logic maybe?
the results would produce a grid something like this for the example above:
ID | Surname | Forenames | DuplicateID | DupSurname | DupForenames 1 | Smith | John Raymond | 2 | Smith | John 1 | Smith | John Raymond | 3 | Smith | Jon Raymond 9 | Brown | Peter David | 343 | Brown | Pete D next batch of duplicates etc etc . . . .
Im trying to delete duplicate records from the output of the query below, if they also meet certain conditions ie 'different address type' then I would merge the records. From the following query how do I go about achieving one and/or the other from either the output, or as an extension of the query itself?
Hello All,I have an issue with dupliate Contact data. Here it is:I have a Contacts table;CREATE TABLE CONTACTS(SSN int,fname varchar(40),lname varchar(40),address varchar(40),city varchar(40),state varchar(2),zip int)Here is some sample data:SSN: 1112223333FNAME: FRANKLNAME: WHALEYADDRESS: NULLCITY: NULLSTATE NYZIP 10033SSN: 1112223333FNAME: NULLLNAME: WHALEYADDRESS: 100 MADISON AVECITY: NEW YORKSTATE NYZIP NULLHow do I merge the 2 rows to create one row as follows:via SQL or T-SQLSSN: 1112223333FNAME: FRANKLNAME: WHALEYADDRESS: 100 MADISON AVECITY: NEW YORKSTATE NYZIP 10033Pointers appreciated.Thanks
I have several sets of timedate ranges and I need to merge the ranges where there is no overlap with the jobs on resource1. In my example data, I want all jobs from ResourceID 1 and those jobs from all other resources where they do not overlap with EXISTING jobs on resource 1 (i.e. imagine I'm trying to select candidates from other resources to fill ResourceID 1 with continuous jobs)
Below is some sample data, my failed attempt and expected results. I managed to excluded everything that should be excluded except job 10
-- Need to select all other jobs from all other resources that can be merged into resource 1 where there is no overlap with existing jobs in resource 1 only
CREATE TABLE #Jobs ( resourceID INT ,JobNo INT ,StartTime SMALLDATETIME ,EndTime SMALLDATETIME ,ShouldBeOmitted BIT
How to merge the data from one database to another if we have identity column on both the database. If we are merging two companies,we need employee table of 2 database and insert them into first database and corresponding fkey tables say some 7 tables.how to merge if the table is having identity column.
Hello there. I am Completely new to SQL and this forum, and this problem that I have may appear to be very basic to you guys but still... I was wondering if I could get some help with a database I am trying to make in MS Access.
I have used the Access TransferText function to import data from a text file into a table with an ID attached to each line, eg.
ID Text 1 Hello world 2 This is an example 3 Of my database
I want to merge the data, or copy it into a field in a new table to get:
ID Text 1 Hello World This is an example Of my database 2 [more imported text from a different table]
and i have been advised that SQL is the best way to do this. Is it possible to have line breaks in a field within microsoft access, or would it have to be structured as
ID Text 1 Hello World This is an Example Of My Database 2 ...
I am trying to create a dimension table and I am pulling in data from two tables to create it. I need all records from table A, any records from table B that are not in table A, and I need to use the fields from B for those records that do match. What would be the best way to approach this, merge join + derived columns, union all + aggrigation? Any suggestions?
It seems like it's harder to do this in ssis rather then just doing it in the database.
I'm trying to avoid a large amount of manual data manipulation.
Here's the background: Legacy system that has (well let's call apples apples) pretty much no method of enforcing data integrity, which has caused a fairly decent amount of garbage data to be inserted in some tables. Pulling one of the [Individuals] table from within this Legacy system and inserting it into a production system, into the Table schema currently in place to track [Individuals] in this Production system.
Problem: Inserting the information is easy, how to deduplicate the records that exist within the staging table that the legacy [Individuals] table has been dumped into in production, prior to insertion. (Wanting to do this programmatically with SQL or SSIS preferably, so that I can alter it later to allow for updating existing/inserting new)
Staging Table Schema:
; CREATE TABLE [dbo].[stage_Individuals]( [SysID] [int] NULL, --Unique, though it's not an index intended to identify the [Individuals] [JJISID] [nvarchar](10) NULL, [NameLast] [nvarchar](30) NULL, [NameFirst] [nvarchar](30) NULL, [NameMiddle] [nvarchar](30) NULL,
[code]....
Scenario: There are records that duplicate the JJISID, though this value is supposed to be unique for every individual. The SYSID is just a Clustered Index (I'm assuming) within the Legacy system and will be most likely dropped when inserted into the Production [Inviduals] table. There are records that are missing their JJISID, though this isn't supposed to happen either, but have valid information within SSN/DOB/Name/etc that can be merged into the correct record that has a JJISID assigned. There is really no data conformity, some records have NULLS for everything except JJISID, or some records will have all the [Individuals] information excluding the JJISID.
Currently I am running the following SQL just to get a list of the records that have a duplicate JJISID (I have other's that partition by Name/DOB/etc and will adapt whatever I come up with to be used for those as well):
; select j.* from (select ROW_NUMBER() OVER (PARTITION BY JJISID ORDER BY JJISID) as RowNum, stage_Individuals.*, COUNT(*) OVER (partition by jjisid) as cnt from stage_Individuals) as j where cnt > 1 and j.JJISID is not nullNow, with SQL Server 2012 or later I could use LAG and LEAD w/ the RowNum value to do my data manipulation...but that won't work because we are on SQL Server 2008 in this environment.
[URL]
With, the following as a potential solution:
GSquared (3/16/2010)Here's a query that seems to do what you need. Try it, let me know if it works.
Performance on it will be a problem, but I can't fine tune that. You'll need to look at various method for getting this kind of data from the table and work out which variation will be best for your data. Without access to the actual table, I can't do that.
; WITH CTE AS (SELECT master_id, MIN(ID) AS first_id, MAX(Account_Expiry) AS latest_expiry FROM #People GROUP BY master_id) SELECT P1.master_id,
[code].....
Unfortunately, I don't think that will accomplish what I'm looking for - I have some records that are duplicated 6 times, and I'm wanting to keep the values within these that aren't NULL.
Basically what I'm looking for, is to update any column with a NULL value to the corresponding Duplicate [Individuals] record value for that column.
**EDIT - Example, Record 1 has a JJISID with NULL NameFirst & NameLast BUT Record 2 has the same JJISID and values for NameFirst & NameLast. I'm wanting to propogate the NameFirst & NameLast from Record2 into Record1
I have a database full of different types of leads some for company A some for company B and so on, each doing a different service. However the leads from B can be used for A and leads from A can be used for B, so I want to merge the data.
Example:
Phone Number Name Home Owner Credit Insurance 727-555-1234 Dave Thomas Yes B 727-555-1234 Dave Thomas Gieco
I would like the end result to be one record:
Phone Number Name Home Owner Credit Insurance 727-555-1234 Dave Thomas Yes B Gieco
Since these were imported into SQL they all have a unique ID, here are the current labels
I have a table with about half a million records, each representing a patient in my county.
Each record has a field (RRank) which basically sorts the patients as to how "unwell" they are according to a previously-applied algorithm. The most unwell patient has an RRank of 1, the next-most unwell has RRank=2 etc.
I have just deleted several hundred records (which relate to patients now deceased) from the table, thereby leaving gaps in the RRank sequence. I want to renumber the remaining recs to get rid of the gaps.
I can see what I want to accomplish by using ROW_NUMBER, thus:
SELECT ROW_NUMBER() Over (ORDER BY RRank) as RecNumber, RRank FROM RPL ORDER BY RRank
I see the numbers in the RecNumber column falling behind the RRank as I scan down the results
My question is: How to convert this into an UPDATE statement? I had hoped that I could do something like:
UPDATE RISC_PatientList_TEMP SET RRank = ROW_NUMBER() Over (ORDER BY RRank);
but the system informs that window functions will only work on SELECT (which UPDATE isn't) or ORDER BY (which I can't legally add).
I found some duplicate data as I was going thru the logic of a data pump. The entire row is not duplicated however.I would like to delete only the one row.
This is a sample of the data: DECLARE @SomeData TABLE ( FirstName varchar(25) , MiddleName varchar(25) , LastName varchar(25) , StreetAddress varchar(25) , Suite varchar(25) , City varchar(25) , [State] varchar(25) , PostalCode varchar(10)
[code]...
As you can see, Joe Smith has two rows, but only one of the rows is complete. I would like to delete only the row that has a NULL value in the phone and area code for Joe Smith. There are a few thousand rows that are like this. They have duplicates all but the area code and phone number.I am used to using a CTE to remove duplicates, but I am a little lost on this one. The things that I have tried, have not worked exactly as I planned.
My new employer is CMM Level 3. As part of the CMM/Personal SoftwareProcess, I am required to create pseudo code for my stored procedureand UDF design. Has anyone done this? If so, can anyone give me someadvice?
I am trying to create a trigger on a table but when I check the syntax it tells me that "The column prefix 'inserted' does not match with a table name or alias used in this query"
CREATE TRIGGER trg_Structural_GenerateBarcode ON [dbo].[tbStructuralComponentSchedule] AFTER INSERT AS DECLARE @iCount int, @cBarcode char (25), @cCode char(4) DECLARE @cProject char(7), @cComponent char(10), @iEntryID int
I'm in the process of trying to identify duplicate contacts. I doing this for millions of contacts and have gotten stuck and could use some elegant solutions!
The business rule is this:
Any contact that has the same name, phone and email address are the same contact Any contact that has the same name, and email address are the same contact Any contact that has the same name, email address, but different phone are a different contact. Any contact that has the same name, email address, and a blank phone can be the same contact as one that has the same name, email address, and has an email address Rank by the DataSource_fk. 1 being the highest
Put another way:
If 3 contacts have the same name, 2 have phone '1112223344' and all three have the email address 'johndoe@gmail.com' they are the same contact and the lowest DataSource_fk should be ranked the highest.
I've used the Row_number over (Partition by) in the past, but am unsure how to deal with the blanks in email and phone.
DROP TABLE [dbo].[TestBusinessContact]; GO CREATE TABLE [dbo].[TestBusinessContact] ( [TestBusinessContact_pk] INT IDENTITY(1,1)NOT NULL, [Business_fk]INT NOT NULL CONSTRAINT DF_TestBusinessContact_Business_fk DEFAULT(0),
Yes, I know this subject has been exhausted, but I need help in locating the discussion which took place a few months ago. Sharon relayed to the group a piece of software (expensive) which would help in my particular situation. I grabbed a demo and have gotten the approval for purchase. Unfortunately, I don't have the thread with me at work.
The problem:
Number Fname Lname Age ID 123 John Franklin 43 1 123 Jane Franklin 40 2 123 Jeff Franklin 12 3 124 Jean Simmons 39 4 125 Gary Bender 37 5 126 Fred Johnson 29 6 126 Fred Johnson 39 7 127 Gene Simmons 47 8
The idea would be to get only unique records from the Number column. I don't care about which information I grab from the other columns, but I must have those fields included. If my resultant result set looked as follows, that would be fine. Or any other way, as long as all of the fields had information and there were only unique values in the Number field.
Number Fname Lname Age ID 123 Jeff Franklin 12 3 124 Jean Simmons 39 4 125 Gary Bender 37 5 126 Fred Johnson 39 7 127 Gene Simmons 47 8
If anyone remembers this discussion, mainly the date, I would really appreciate it.
I have two tables, one contains all work orders, the second contains records on work orders that are linked to customoer orders. I'm trying to create a query that will return specific fields from the table that contains orders in the linked order table, and only the work orders in the all order table that (work_order) do not exist in the linked order table (demand_supply_link). I have tried several queries and cannot get the results I desire. Here is the query I am currently trying.
SELECT DISTINCT WORK_ORDER.DESIRED_WANT_DATE as 'Want Date', DEMAND_SUPPLY_LINK.SUPPLY_BASE_ID as 'WO Id', WORK_ORDER.DESIRED_QTY as 'End Qty', DEMAND_SUPPLY_LINK.SUPPLY_PART_ID as 'Part Id', CUST_ORDER_LINE.CUSTOMER_PART_ID as 'Cust Part', OPERATION.RESOURCE_ID as Resource, PART.DESCRIPTION as Description, CUSTOMER.NAME as Name FROM ((((DEMAND_SUPPLY_LINK INNER JOIN CUST_ORDER_LINE ON DEMAND_SUPPLY_LINK.DEMAND_BASE_ID = CUST_ORDER_LINE.CUST_ORDER_ID) INNER JOIN WORK_ORDER ON DEMAND_SUPPLY_LINK.SUPPLY_BASE_ID = WORK_ORDER.BASE_ID) INNER JOIN OPERATION ON WORK_ORDER.BASE_ID = OPERATION.WORKORDER_BASE_ID) INNER JOIN PART ON WORK_ORDER.PART_ID = PART.ID) INNER JOIN (CUSTOMER INNER JOIN CUSTOMER_ORDER ON CUSTOMER.ID = CUSTOMER_ORDER.CUSTOMER_ID) ON CUST_ORDER_LINE.CUST_ORDER_ID = CUSTOMER_ORDER.ID WHERE WORK_ORDER.DESIRED_WANT_DATE Is Not Null AND OPERATION.RESOURCE_ID in ('ASSY','FAB 1','PLAY TRK') AND WORK_ORDER.STATUS='R'
UNION SELECT distinct work_order.desired_want_date as 'Want Date', work_order.BASE_id as 'WO Id', work_order.desired_qty as 'End Qty', work_order.part_id as 'Part Id', operation.resource_id as Resource, part.description as Description FROM WORK_ORDER INNER JOIN PART ON PART_ID=WORK_ORDER.PART_ID INNER JOIN OPERATION ON WORK_ORDER.BASE_ID=OPERATION.WORKORDER_BASE_ID WHERE WORK_ORDER.DESIRED_WANT_DATE IS NOT NULL AND OPERATION.RESOURCE_ID IN ('ASSY','FAB 1', 'PLAY TRK') AND WORK_ORDER.STATUS='R'
This is the error I receive: Server: Msg 205, Level 16, State 1, Line 1 All queries in an SQL statement containing a UNION operator must have an equal number of expressions in their target lists.
The all orders table (work_order) will not have the other fields to link to as there is no customer order linked to them.
Can someone tell me the best procedure when trying to find duplicate records within a table(s)?
I'm new using SQL server and I have been informed that there maybe some DUPS within unknown tables. I need to find these DUPS.
If someone can tell me how to perform this procedure I would apprciate it. And if you reply can also include examples that i could follow for my records.
Table1 has shop# and shop_id. Every shop# should have only one shop_ID. There has been a few data entry errors where a shop# has duplicate a shop_id. How to write a query for shop#s that have more than one shop_id?
Not so sure how simple this question is but here is what happened. I installed SQL Server 2005 on a new Win Server 2003. I exported the tables and their data from the old machine to the newly established database on the new machine.
It looks like all my records were duplicated. When I try to delete one of the duplicates it won't work because both rows are effected. I can't set my primary key now and if I try to create a new database with the primary key already set than the import fails.
Any one run into this before or know what's going on?
Hi,I have written a web application using dreamweaver MX, asp.net, and MSsql server 2005.The problem I am having occurs when I attempt to edit a record. I have setup a datagrid with freeform fields so that the user can click on edit, make the required changes within the data grid then click update. The data is then saved to the database. All this was created using dreameaver and most of the code was automatically generated for me.The problem is that, not everytime, but sometimes when I go to edit a record once I hit the update button to save the changes the record is duplicated 1 or more times. This doesnt happen everytime but when it does it duplicates the record between 1 and about 5 times. I have double checked everything but cannot find anything obvious that may be causing this issue. Does anyone have any suggestions as to what I should look for? Is this a coding error or something wrong with MSsql? Any ideas?Thanks in advance-Mitch
hi all, How do i avoid duplicate records on my database? i have 4 textboxes that collect user information and this information is saved in the database. when a user fills the textboxes and clicks the submit button, i want to check through the database if the exact records exist in the database before the data is saved. if the user is registered on the database, he wont be allowed to login. how can i acheive this? i thought of using the comparevalidator but i'm not sure how to proceed. thanks