T-SQL (SS2K8) :: Identifying Potential Duplicate Records In A Given Table?
Oct 8, 2015
any useful SQL Queries that might be used to identify lists of potential duplicate records in a table?
For example I have Client Database that includes a table dbo.Clients. This table contains various columns which could be used to identify possible duplicate records, such as Surname | Forenames | DateOfBirth | NINumber | PostalCode etc. . The data contained in these columns is not always exactly the same due to differences caused by user data entry; so some records may have missing data from some of the columns and there could be spelling differences too. Like the following examples:
1 | Smith | John Raymond | NULL | NI990946B | SW12 8TQ
2 | Smith | John | 06/03/1967 | NULL | SW12 8TQ
3 | Smith | Jon Raymond | 06/03/1967 | NI 99 09 46 B | SW12 8TQ
The problem is that whilst it is easy for a human being to review these 3 entries and conclude that they are most likely the same Client entered in to the database 3 times; I cannot find a reliable way of identifying them using a SQL Query.
I've considered using some sort of concatenation to a new column, minus white space and then using a "WHERE column_name LIKE pattern" query, but so far I can't get anything to work well enough. Fuzzy Logic maybe?
the results would produce a grid something like this for the example above:
ID | Surname | Forenames | DuplicateID | DupSurname | DupForenames
1 | Smith | John Raymond | 2 | Smith | John
1 | Smith | John Raymond | 3 | Smith | Jon Raymond
9 | Brown | Peter David | 343 | Brown | Pete D
next batch of duplicates etc etc . . . .
View 7 Replies
ADVERTISEMENT
Apr 21, 2014
We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.
In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.
drop table mytbl
create table mytbl (
effective_date smalldatetime,
expiration_date smalldatetime,
employee_id int,
employee_name varchar(30),
[code]....
In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.
We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.
alter table mytbl
drop column last_login_date
select *
from mytbl
order by employee_id, effective_date
Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):
create table mytbl2 (
effective_date smalldatetime,
expiration_date smalldatetime,
employee_id int,
employee_name varchar(30),
employee_dept int
[code]...
Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).
My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.
As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.
View 2 Replies
View Related
Jan 3, 2015
rewrite the below two queries (so that i can avoid duplicates) i need to send email to everyone not the dup[right][/right]licated ones)?
Create table #MyPhoneList
(
AccountID int,
EmailWork varchar(50),
EmailHome varchar(50),
EmailOther varchar(50),
[Code] ....
--> In this table AccountID is uniquee
--> email values could be null or repetetive for work / home / Other (same email can be used more than one columns for accountid)
-- a new column will be created with name as Sourceflag( the value could be work, Home, Other depend on email coming from) then removes duplicates
SELECT AccountID , Email, SourceFlag, ROW_NUMBER() OVER(PARTITION BY AccountID, Email ORDER BY Sourceflag desc) AS ROW
INTO #List
from (
SELECTAccountID
, EmailWorkAS EMAIL
, 'Work'AS SourceFlag
FROM#MyPhoneList (NoLock) eml
WHEREIsOffersToWorkEmail= 1
[code]....
View 9 Replies
View Related
Oct 21, 2014
Im trying to delete duplicate records from the output of the query below, if they also meet certain conditions ie 'different address type' then I would merge the records. From the following query how do I go about achieving one and/or the other from either the output, or as an extension of the query itself?
SELECT
a1z103acno AccountNumber
, a1z103frnm FirstName
, a1z103lanm LastName
, a1z103ornm OrgName
, a3z103adr1 AddressLine1
, A3z103city City
, A3z103st State
[code]...
View 1 Replies
View Related
Apr 22, 2015
There is one report to identify potential duplicate in a table and it is performing poor.I'm now tuning the existing SP and got struck in modifiying it. rewrite the query in a best way. I just pasted below an example of query which is now in a report.The report will be run every week currently the table has 10 million records, and every week there will 5k to 10k will be added up so with that 5k to 10 k we have to check all the 10 miilion rows that if it is duplciated the logic is (surname = surmane or forename = forename or DOB =DOB )
Create table #employee
(
ID int,
empid varchar(100),
surname varchar(100),
forename varchar(100),
DOB datetime,
empregistereddate datetime,
Createdate datetime
[code]...
View 7 Replies
View Related
Mar 9, 2001
Hi everybody,
I'm migrating a table that has above 20,000 records and lot of duplication.Let's say an Employee table with multiple records having slight
diference in the EmployeeName field.Now nobody would like to sit and manually identify them with such hugh number of records.
Is there any way which would help me identify most of them and
reduce the redundancy.
Thanx
Aby...
View 2 Replies
View Related
Dec 3, 2014
I have a table with about half a million records, each representing a patient in my county.
Each record has a field (RRank) which basically sorts the patients as to how "unwell" they are according to a previously-applied algorithm. The most unwell patient has an RRank of 1, the next-most unwell has RRank=2 etc.
I have just deleted several hundred records (which relate to patients now deceased) from the table, thereby leaving gaps in the RRank sequence. I want to renumber the remaining recs to get rid of the gaps.
I can see what I want to accomplish by using ROW_NUMBER, thus:
SELECT ROW_NUMBER() Over (ORDER BY RRank) as RecNumber, RRank
FROM RPL
ORDER BY RRank
I see the numbers in the RecNumber column falling behind the RRank as I scan down the results
My question is: How to convert this into an UPDATE statement? I had hoped that I could do something like:
UPDATE RISC_PatientList_TEMP
SET RRank = ROW_NUMBER() Over (ORDER BY RRank);
but the system informs that window functions will only work on SELECT (which UPDATE isn't) or ORDER BY (which I can't legally add).
View 5 Replies
View Related
Mar 6, 2014
I have a snapshot table of about 15 million records in the form of:
InvoiceIDLineItemIDSnapshotDateAmount
1 1 20140101 12
1 2 20140102 14
1 3 20140103 17
2 1 20140101 10
2 2 20140102 5
1 2 20140105 15
1 3 20140105 20
I want to create an additional column called Current as shown below:
InvoiceIDLineItemIDSnapshotDateAmount Current
1 1 20140101 12 1
1 2 20140102 14 0
1 3 20140103 17 0
2 1 20140101 10 1
2 2 20140102 5 1
1 2 20140105 15 1
1 3 20140105 20 1
How can we write a query to achieve this while keeping in mind:
- We do not want to do unnecessary record lookups and Updates
- We only update records that corresponds to new entries. For example, we should not touch the record for InvoiceID = 2 in the above example
View 6 Replies
View Related
Oct 5, 2007
How do i remove duplicate records from a table with a single query without using cursors or anything like that.Sample :tempCol11221P.S The table has only one column
View 2 Replies
View Related
Aug 27, 2014
I am performing analysis of linked servers across 2000-2008R2 and need to find/build a list of linked servers that are truly active. For the sake of the post let's define 'active' have executed a distributed query in the last 5 days.
I have been scanning the DMVs without much success. Perhaps I must look more closely at MSDTC?
The end result would be to cleanup 300+ linked servers across 40+ SQL Servers.
View 4 Replies
View Related
Oct 5, 2007
Hello friends,
I have a one problem, i have a table in that some reocrds are duplicate.i want to find which records are duplicate.
for exp. my table is as follows
emp_id emp_name
1 aa
2 bb
3 cc
1 aa
3 cc
3 cc
and i want the result is like
emp_id emp_name
1 aa
1 aa
3 cc
3 cc
3 cc
View 6 Replies
View Related
Nov 29, 2004
Hello All,
Please show me how to delete duplicate records from a table.
Thanks in Advance
View 1 Replies
View Related
Feb 9, 2012
I have a table that has duplicate records with the exception of the ID and I am trying to write a while loop that would go through the table, locate all duplicate records based on a field called LASTNAME.
View 5 Replies
View Related
May 16, 2008
Hello,
I need some help on this.
I want to retrieve all the duplicate records from a particulat column.
For eg suppose i have a table named testtable
columns in the table : item_id,ref_no,title,address
now i need to check if there are any duplicate entries in the ref_no column and if there are any retrieve the records.
Gaurish Salunke
Software Developer
OPSPL
View 4 Replies
View Related
Jul 20, 2005
I just discovered that all my records appear twice inside my table, inother words, they repeat on the row below. How can I delete all of theduplicates? I'm sure there must be a tidy line of sql to do that.Thanks,Bill
View 1 Replies
View Related
Jul 20, 2005
I uploaded some data about 2 or 3 times and it keep appending it to thetable.Now I want to keep only first duplicate and delete rest of.Suppose part number 123 has been added 3 times so I want to keep only 1record.Thanks
View 1 Replies
View Related
Apr 3, 2008
Hi Guyz
say i have a table
10011 NULL NULL Classical NULL
10011 NULL NULL Classical NULL
10004 NULL NULL Classical NULL
10004 NULL NULL Classical NULL
10004 NULL NULL Classical NULL
10005 NULL NULL Classical NULL
i want to eliminate the duplicate records and atable should look like
10011 NULL NULL Classical NULL
10004 NULL NULL Classical NULL
10005 NULL NULL Classical NULL
do we have any simple sql to do it or something complex.
thanks in advance !
View 6 Replies
View Related
Jul 6, 2006
I loaded one table via SSIS and found that it contained many duplicate records (from the input source). I can create a SQL task to delete them, but I wonder if SSIS offers and task "out of the box" to delete dups?
TAI,
barkingdog
View 6 Replies
View Related
Mar 4, 2008
Hello Frnds....Can anybody give the answer of this question as How to Delete duplicate records from Table ?
I Know that with check option and also with Unique Constraint we can avoid to enter duplicate records in table but How to delete from table which does not have any constraints ?
View 8 Replies
View Related
Nov 15, 2015
I have this table:
id | Name | Age
==================
1 | AAA | 22
1 | AAA | 22
2 | BBB | 33
2 | BBB | 33
2 | BBB | 33
3 | CCC | 44
4 | DDD | 55
I need to delete from this table all the duplicate records and leave only one record. The table will looks like this:
id | Name | Age
==================
1 | AAA | 22
2 | BBB | 33
3 | CCC | 44
4 | DDD | 55
I work with sqlCE for Mobile...
View 8 Replies
View Related
Mar 17, 2014
This seems simple enough but for some reason, my brain isn't working.
I have a lookup table:
Table A: basically dates every 30 days
1/1/2014
2/3/2014
3/3/2014
4/3/2014
I have Table b that has records and dates created assocated with each record
I want all records that fall between the 1st 30 days to have an additional column that indicates 30
union
records with additional column indicating 60 days that fall between the 30 and 60 day
union
records with additional column indicating 90days that fall between the 60 and 90 day mark.
Is there an easy way to do this?
View 6 Replies
View Related
Feb 8, 2007
I am trying to test some data handling between two different versions of an application.
I have restored the database schema twice, once as DB_old and once as DB_new.
I import a transaction using the new application into DB_new and I import the SAME transaction into the DB_old using the old version of application.
I then have to eyeball the data in SQL Query Analyzer to try to identify problems where the fields have received different values.
I have done this by running a select statement twice telling it to use both of the databases and then viewing it in two grids. There are a lot of columns so I have to do a lot of scrolling across the screen to do the comparison, and since the view is in two separate grids I have to hop back and forth and click the scroll bars, etc.
It seems like there has to be a better way. I don't suppose there is a way to lock the two grids so they both scroll together is there?
I was thinking maybe I could insert each of the selects into a temporary table and then do some kind of comparison to identify which values were different in each column. Some of the columns will have differences, like the timestamp, but if I could somehow identify which columns were different then I could eyeball them to identify which of those were okay to be different and which of them were actually bugs from the changed application version.
I have no idea how to identify those individual columns with different data values or even where to start.
Just so you understand better what I am doing now here is the query I am running that I then eyeball:
use DB_new
select * from claim where claim_id = 35144
use DB_old
select * from claim where claim_id = 35144
Thanks for any ideas.
View 7 Replies
View Related
Mar 6, 2014
I have a table that has multiple records as illustrated in the simple list below. The real list is much longer.
MachineA 1/1/2008
MachineA 1/3/2008
MachineB 1/7/2008
MachineB 1/8/2009
MachineB 5/5/2010
MachineA 5/7/2011
MachineA 4/2/2013
I need to query to return a result for each unique machine with the latest date. The example result below would be returned because they have the latest date.
MachineA 5/7/2011
MachineB 5/5/2010
Select Distinct would almost do it, but I need each unique machine that has the latest date.
View 9 Replies
View Related
Apr 14, 2014
Suppose I have 2 table
1)Main
2)History
Main table maintain all the records having columns MAIN_SKU,DEDUCTIBLE_AMT,model_id,catagory,ModifiedDate
IF DEDUCTIBLE_AMT is changes it will place entry in history table ,columns are same with history_id
i want to display distinct main_sku from history table(all columns) with last DEDUCTIBLE_AMT changed from history table
table structure
main table
MAIN_SKUDEDUCTIBLE_amtmodel_idcatagory
1100100phone
2150101phone
3200109smartphone
4100202smartphone
History table
History_idMAIN_SKUDEDUCTIBLE_amtmodel_idcatagoryModifiedDate
11150100phone4/14/2014
21200101phone4/13/2014
34109202smartphone4/14/2014
44101202smartphone4/13/2014
52200101phone4/13/2014
63100109smartphone4/12/2014
View 3 Replies
View Related
Apr 24, 2014
I have a table called TBLCataloghi
I have multiple records with colunms codpro and codcat equal
They differ only by a date called catalog.datfin
I'd like to select all rows but with the same codpro,codcat, obtaining ONLY the row with MIN () field datfin
Field datfin is a date..
Ex. codpro = 'PIPPO'
codcat = 'MK'
DATFIN = 01/01/2010
codpro = 'PIPPO'
codcat = 'MK'
DATFIN = 10/07/2014
I'd like to read both records but in SELECT obtain only the record with datfin MIN (01-10-2010)
I did the query but i was not able to do nothing of good. I obtain all times both records...
SELECT catalog.codpro AS CodProdotto,
catalog.codcat AS CodiceCatalogo,
MIN(catalog.datfin)
FROM pub.catalog
WHERE catalog.codcat = 'MK'
GROUP BY catalog.codpro,catalog.codcat ,catalog.datfin
View 2 Replies
View Related
Sep 26, 2014
I have this query below that I created to do a count, but I don't think this is what I needed.
I need to find the duplicates. Example, if
CLI_ID1 12345 has 4 CLIP records, each CLIP record should have a different CLIP rank. I need to find scenarios where 2 (or more) of the CLIP records have the same CLIP RANK. If there are duplicate CLIP_RANKs within the same CLI_ID,
Select Distinct
cli_id1, count(clip_rank) countrank
FROM impact.dbo.CLI
LEFT JOIN impact.dbo.CLIO ON CLI.CLI_ID1 = CLIO.clio_id1
left join
impact.dbo.clip ON cli_id1 = clip_id1
Where (clio_trm = '' or clio_trm = NULL or clio_trm is null)
group by cli_id1
order by cli_id1
View 1 Replies
View Related
May 2, 2014
using below query
DELETE FROM Table_name
WHERE Date_column < GETDATE() - 30
am able to delete old records morethan 30 days, but i want to results to be saved in file.
before deleting i want to a craete a file and save the to be deleted records.
View 7 Replies
View Related
Jun 5, 2014
I need to export the records of a table in xml format.
create table ##prova
( Valuta varchar(2),
Misura float
)
insert into ##prova values ('EU',1000)
insert into ##prova values ('$',2000)
The final result must be something like this:
<root>
<obs id=”0”>
<dim name=”Valuta” value=”EU” />
<dim name=”Misura” value=”1000” />
</obs>
<obs id=”0”>
<dim name=”Valuta” value=”$” />
<dim name=”Misura” value=”2000” />
</obs>
</root>
View 2 Replies
View Related
Sep 4, 2015
I have a scenario where ID has three flags.
For example
ID flag1 flag2 flag3
1 0 1 0
2 1 0 0
1 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
Now I want the records having flag2=1 only.. I.e ID=3 has flag2=1 where as ID = 1 and 2 has flag1 and flag3 =1 along with flag2=1. I don't want ID=1 and 2.
I can't make ID unique or primary. I tried with case when statements but it I am somehow missing the basic logic.
View 5 Replies
View Related
Mar 17, 2014
I have a series of records based on empid where I want to identify the empid that may have discrepancies listed. I have some empids that are listed more than once and have different DOB's. In the example I am trying to Create a DOB_ERROR column and either say yes if the DOB doesn't match the other records in the file with the same empid.
SELECT
Empid,
DOB,
CASE WHEN DOB = DOB THEN 'No' ELSE 'Yes' END AS DOB_ERROR,
City,
St,
Gender
FROM Emp
WHERE EMPID IN
('12335', '23456', '545432','231245')
View 3 Replies
View Related
Jun 9, 2014
I have 2 tables in a 1: n relation. How can i get a select statement that the field in the n-relation with outputs, separated by a semicolon; Example: One person have many Job Titles
Table1 (tblPerson)
Table2 (tblTitles)
1, "John", "Miller", "Employee; Admin; Consultant"
2, "Joan", "Stevens", "Employee, Software Engineer, Consultant"
and so on .... 1 in select statement:
View 1 Replies
View Related
Aug 6, 2015
I have a table that I need to do some computations on all the data but first I need to remove the duplicate records and insert the results into a destination table. Here's the example below. My table has 3.1 million rows. I have tried using the DISTINCT and the GROUP BY but both ways to select the data takes about half a minute to run. I'm wondering if there is a way to increase performance. Users are ok with this time since the process runs overnight but improving it won't hurt. I do have a clustered index on these fields but that doesn't seem to improve any.
SELECTDateYear ,
DateMonth ,
Nbr ,
Nbr1 ,
Nbr2 ,
Datafield1 ,
Datafield2,
[code].....
View 7 Replies
View Related
Apr 14, 2015
I have around 3 tables having around 20 to 30gb of data. My table A related to table B by a FK and same way table B related to table C by FK. I would like to delete all rows satisfying certain condition from table A and all corresponding related records from table B and C. I have created a query to delete the grandchild first, followed by child table and finally parent. I have used inner join in my delete query. As you all know, inner join delete operations, are going to be extremely resource Intensive especially on bigger tables.
What is the best approach to delete all these rows? There are many constraints, triggers on these tables. Also, there might be some FK relations to other tables as well.
View 3 Replies
View Related