T-SQL (SS2K8) :: Identifying Potential Duplicate Records In A Given Table?
Oct 8, 2015
any useful SQL Queries that might be used to identify lists of potential duplicate records in a table?
For example I have Client Database that includes a table dbo.Clients. This table contains various columns which could be used to identify possible duplicate records, such as Surname | Forenames | DateOfBirth | NINumber | PostalCode etc. . The data contained in these columns is not always exactly the same due to differences caused by user data entry; so some records may have missing data from some of the columns and there could be spelling differences too. Like the following examples:
1 | Smith | John Raymond | NULL | NI990946B | SW12 8TQ
2 | Smith | John | 06/03/1967 | NULL | SW12 8TQ
3 | Smith | Jon Raymond | 06/03/1967 | NI 99 09 46 B | SW12 8TQ
The problem is that whilst it is easy for a human being to review these 3 entries and conclude that they are most likely the same Client entered in to the database 3 times; I cannot find a reliable way of identifying them using a SQL Query.
I've considered using some sort of concatenation to a new column, minus white space and then using a "WHERE column_name LIKE pattern" query, but so far I can't get anything to work well enough. Fuzzy Logic maybe?
the results would produce a grid something like this for the example above:
ID | Surname | Forenames | DuplicateID | DupSurname | DupForenames
1 | Smith | John Raymond | 2 | Smith | John
1 | Smith | John Raymond | 3 | Smith | Jon Raymond
9 | Brown | Peter David | 343 | Brown | Pete D
next batch of duplicates etc etc . . . .
We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.
In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.
In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.
We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.
alter table mytbl drop column last_login_date select * from mytbl order by employee_id, effective_date
Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):
Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).
My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.
As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.
Im trying to delete duplicate records from the output of the query below, if they also meet certain conditions ie 'different address type' then I would merge the records. From the following query how do I go about achieving one and/or the other from either the output, or as an extension of the query itself?
There is one report to identify potential duplicate in a table and it is performing poor.I'm now tuning the existing SP and got struck in modifiying it. rewrite the query in a best way. I just pasted below an example of query which is now in a report.The report will be run every week currently the table has 10 million records, and every week there will 5k to 10k will be added up so with that 5k to 10 k we have to check all the 10 miilion rows that if it is duplciated the logic is (surname = surmane or forename = forename or DOB =DOB )
Create table #employee ( ID int, empid varchar(100), surname varchar(100), forename varchar(100), DOB datetime, empregistereddate datetime, Createdate datetime
Hi everybody, I'm migrating a table that has above 20,000 records and lot of duplication.Let's say an Employee table with multiple records having slight diference in the EmployeeName field.Now nobody would like to sit and manually identify them with such hugh number of records. Is there any way which would help me identify most of them and reduce the redundancy.
I have a table with about half a million records, each representing a patient in my county.
Each record has a field (RRank) which basically sorts the patients as to how "unwell" they are according to a previously-applied algorithm. The most unwell patient has an RRank of 1, the next-most unwell has RRank=2 etc.
I have just deleted several hundred records (which relate to patients now deceased) from the table, thereby leaving gaps in the RRank sequence. I want to renumber the remaining recs to get rid of the gaps.
I can see what I want to accomplish by using ROW_NUMBER, thus:
SELECT ROW_NUMBER() Over (ORDER BY RRank) as RecNumber, RRank FROM RPL ORDER BY RRank
I see the numbers in the RecNumber column falling behind the RRank as I scan down the results
My question is: How to convert this into an UPDATE statement? I had hoped that I could do something like:
UPDATE RISC_PatientList_TEMP SET RRank = ROW_NUMBER() Over (ORDER BY RRank);
but the system informs that window functions will only work on SELECT (which UPDATE isn't) or ORDER BY (which I can't legally add).
How can we write a query to achieve this while keeping in mind:
- We do not want to do unnecessary record lookups and Updates - We only update records that corresponds to new entries. For example, we should not touch the record for InvoiceID = 2 in the above example
How do i remove duplicate records from a table with a single query without using cursors or anything like that.Sample :tempCol11221P.S The table has only one column
I am performing analysis of linked servers across 2000-2008R2 and need to find/build a list of linked servers that are truly active. For the sake of the post let's define 'active' have executed a distributed query in the last 5 days.
I have been scanning the DMVs without much success. Perhaps I must look more closely at MSDTC?
The end result would be to cleanup 300+ linked servers across 40+ SQL Servers.
Hello friends, I have a one problem, i have a table in that some reocrds are duplicate.i want to find which records are duplicate. for exp. my table is as follows emp_id emp_name 1 aa 2 bb 3 cc 1 aa 3 cc 3 cc and i want the result is like emp_id emp_name 1 aa 1 aa 3 cc 3 cc 3 cc
I have a table that has duplicate records with the exception of the ID and I am trying to write a while loop that would go through the table, locate all duplicate records based on a field called LASTNAME.
I just discovered that all my records appear twice inside my table, inother words, they repeat on the row below. How can I delete all of theduplicates? I'm sure there must be a tidy line of sql to do that.Thanks,Bill
I uploaded some data about 2 or 3 times and it keep appending it to thetable.Now I want to keep only first duplicate and delete rest of.Suppose part number 123 has been added 3 times so I want to keep only 1record.Thanks
I loaded one table via SSIS and found that it contained many duplicate records (from the input source). I can create a SQL task to delete them, but I wonder if SSIS offers and task "out of the box" to delete dups?
Hello Frnds....Can anybody give the answer of this question as How to Delete duplicate records from Table ? I Know that with check option and also with Unique Constraint we can avoid to enter duplicate records in table but How to delete from table which does not have any constraints ?
I am trying to test some data handling between two different versions of an application.
I have restored the database schema twice, once as DB_old and once as DB_new.
I import a transaction using the new application into DB_new and I import the SAME transaction into the DB_old using the old version of application.
I then have to eyeball the data in SQL Query Analyzer to try to identify problems where the fields have received different values.
I have done this by running a select statement twice telling it to use both of the databases and then viewing it in two grids. There are a lot of columns so I have to do a lot of scrolling across the screen to do the comparison, and since the view is in two separate grids I have to hop back and forth and click the scroll bars, etc.
It seems like there has to be a better way. I don't suppose there is a way to lock the two grids so they both scroll together is there?
I was thinking maybe I could insert each of the selects into a temporary table and then do some kind of comparison to identify which values were different in each column. Some of the columns will have differences, like the timestamp, but if I could somehow identify which columns were different then I could eyeball them to identify which of those were okay to be different and which of them were actually bugs from the changed application version.
I have no idea how to identify those individual columns with different data values or even where to start.
Just so you understand better what I am doing now here is the query I am running that I then eyeball: use DB_new select * from claim where claim_id = 35144 use DB_old select * from claim where claim_id = 35144
I need to query to return a result for each unique machine with the latest date. The example result below would be returned because they have the latest date.
MachineA 5/7/2011 MachineB 5/5/2010
Select Distinct would almost do it, but I need each unique machine that has the latest date.
I have this query below that I created to do a count, but I don't think this is what I needed.
I need to find the duplicates. Example, if
CLI_ID1 12345 has 4 CLIP records, each CLIP record should have a different CLIP rank. I need to find scenarios where 2 (or more) of the CLIP records have the same CLIP RANK. If there are duplicate CLIP_RANKs within the same CLI_ID,
Select Distinct cli_id1, count(clip_rank) countrank FROM impact.dbo.CLI LEFT JOIN impact.dbo.CLIO ON CLI.CLI_ID1 = CLIO.clio_id1
left join impact.dbo.clip ON cli_id1 = clip_id1 Where (clio_trm = '' or clio_trm = NULL or clio_trm is null) group by cli_id1 order by cli_id1
Now I want the records having flag2=1 only.. I.e ID=3 has flag2=1 where as ID = 1 and 2 has flag1 and flag3 =1 along with flag2=1. I don't want ID=1 and 2.
I can't make ID unique or primary. I tried with case when statements but it I am somehow missing the basic logic.
I have a series of records based on empid where I want to identify the empid that may have discrepancies listed. I have some empids that are listed more than once and have different DOB's. In the example I am trying to Create a DOB_ERROR column and either say yes if the DOB doesn't match the other records in the file with the same empid.
SELECT Empid, DOB, CASE WHEN DOB = DOB THEN 'No' ELSE 'Yes' END AS DOB_ERROR, City, St, Gender FROM Emp WHERE EMPID IN
I have 2 tables in a 1: n relation. How can i get a select statement that the field in the n-relation with outputs, separated by a semicolon; Example: One person have many Job Titles
Table1 (tblPerson) Table2 (tblTitles) 1, "John", "Miller", "Employee; Admin; Consultant" 2, "Joan", "Stevens", "Employee, Software Engineer, Consultant" and so on .... 1 in select statement:
I have a table that I need to do some computations on all the data but first I need to remove the duplicate records and insert the results into a destination table. Here's the example below. My table has 3.1 million rows. I have tried using the DISTINCT and the GROUP BY but both ways to select the data takes about half a minute to run. I'm wondering if there is a way to increase performance. Users are ok with this time since the process runs overnight but improving it won't hurt. I do have a clustered index on these fields but that doesn't seem to improve any.
I have around 3 tables having around 20 to 30gb of data. My table A related to table B by a FK and same way table B related to table C by FK. I would like to delete all rows satisfying certain condition from table A and all corresponding related records from table B and C. I have created a query to delete the grandchild first, followed by child table and finally parent. I have used inner join in my delete query. As you all know, inner join delete operations, are going to be extremely resource Intensive especially on bigger tables.
What is the best approach to delete all these rows? There are many constraints, triggers on these tables. Also, there might be some FK relations to other tables as well.