T-SQL (SS2K8) :: Ranking Duplicate Contacts Using Multiple Columns
Jun 8, 2015
I'm in the process of trying to identify duplicate contacts. I doing this for millions of contacts and have gotten stuck and could use some elegant solutions!
The business rule is this:
Any contact that has the same name, phone and email address are the same contact
Any contact that has the same name, and email address are the same contact
Any contact that has the same name, email address, but different phone are a different contact.
Any contact that has the same name, email address, and a blank phone can be the same contact as one that has the same name, email address, and has an email address
Rank by the DataSource_fk. 1 being the highest
Put another way:
If 3 contacts have the same name, 2 have phone '1112223344' and all three have the email address 'johndoe@gmail.com' they are the same contact and the lowest DataSource_fk should be ranked the highest.
I've used the Row_number over (Partition by) in the past, but am unsure how to deal with the blanks in email and phone.
DROP TABLE [dbo].[TestBusinessContact];
GO
CREATE TABLE [dbo].[TestBusinessContact]
(
[TestBusinessContact_pk] INT IDENTITY(1,1)NOT NULL,
[Business_fk]INT NOT NULL CONSTRAINT DF_TestBusinessContact_Business_fk DEFAULT(0),
I'm new to SQL with 2 weeks under my belt....lol, so this may be a simple edit:
When I run the following query, I can get a list of all dups in the contact field: ++++++++++++++++++++++++++++++++++ SELECT full_name, COUNT(full_name) AS NumOccurrences FROM contact GROUP BY full_name HAVING ( COUNT(full_name) > 1 ) ++++++++++++++++++++++++++++++++++ However: I need to make sure I am de-activating (active = 0) only the contacts where they are listed more then once within the same company table (company.company_id) and the condition is that phone is NULL. I can't seem to make it work. Does anyone have any suggestions for an UPDATE I can use?
I'd like to first figure out the count of how many rows are not the Current Edition have the following:
Second I'd like to be able to select the primary key of all the rows involved
Third I'd like to select all the primary keys of just the rows not in the current edition
Not really sure how to describe this without making a dataset
CREATE TABLE [Project].[TestTable1]( [TestTable1_pk] [int] IDENTITY(1,1) NOT NULL, [Source_ID] [int] NOT NULL, [Edition_fk] [int] NOT NULL, [Key1_fk] [int] NOT NULL, [Key2_fk] [int] NOT NULL,
[Code] .....
Group by fails me because I only want the groups where the Edition_fk don't match...
I am facing a problem in writing the stored procedure for multiple search criteria.
I am trying to write the query in the Procedure as follows
Select * from Car where Price=@Price1 or Price=@price2 or Price=@price=3 and where Manufacture=@Manufacture1 or Manufacture=@Manufacture2 or Manufacture=@Manufacture3 and where Model=@Model1 or Model=@Model2 or Model=@Model3 and where City=@City1 or City=@City2 or City=@City3
I am Not sure of the query but am trying to get the list of cars that are to be filtered based on the user input.
I have a question regarding duplicate records, the thing is I'm able to query for duplicated records if I type the following:
select ColumnName from TableName where ColumnName in ( select ColumnName from TableName group by ColumnName having count(*) > 1 )
That gives me duplicate records for one column, but I need find duplicate records in more than one column (4 columns to be exact), but the way I need to find these records is they all have to be duplicate, what I'm trying to say is I don't don't want to find the following:
First Last Age Email John Smith 25 jsmith@hotmail.com John Smith 26 jsmith@hotmail.com John Smith 25 jsmith4@hotmail.com
I need to find the following:
First Last Age Email John Smith 25 jsmith@hotmail.com John Smith 25 jsmith@hotmail.com John Smith 25 jsmith@hotmail.com
So all the columns must be exactly the same, that's the only condition I want to show the records, is there any way to do this?
For the record, I'm using MS SQL Server 2000, thank you.
Can you update data from multiple tables in the same UPDATE statement, by joining those tables in a CTE ?
For example, this fails:
DECLARE @UPDCATE_COUNT AS int = 100000; WITH COMBINED_TABLES AS ( SELECT TOP (@UPDATE_COUNT) T.UpdateID, T.IS_UPDATED, U.[Description] FROM dbo.Table1 AS U INNER JOIN dbo.Table2 AS T
I have a case where if the Id field is a specific value, I don't want to allow null in another field, but if the Id value <> a specific value, null is ok.
In the example below, inserting the first record should succeed, the second should succeed, and the 3rd should fail. Right now the 2nd two fail. I gotta be missing something easy, but I can't figure it out.
USE tempdb GO IF OBJECT_ID('tempdb.dbo.CheckConstraintTest') IS NOT NULL DROP TABLE tempdb.dbo.CheckConstraintTest; CREATE TABLE CheckConstraintTest
SELECT 'Type'[Type] ,CASE WHEN code='09' THEN SUM(Amt/100) ELSE 0 END ,CASE WHEN code='10' THEN SUM(Amt/100) ELSE 0 END ,CASE WHEN code='11' THEN SUM(Amt/100) ELSE 0 END ,CASE WHEN code='12' THEN SUM(Amt/100) ELSE 0 END FROM Table1 WHERE (Code BETWEEN '09' AND '12') GROUP BY Code
and the output
Column 1 Column 2 Column 3 Column 4 Type 14022731.60 0.00 0.00 0.00 Type 0.00 4749072.19 0.00 0.00 Type 0.00 0.00 149214.04 0.00 Type 0.00 0.00 0.00 792210.10
How can I modify the query to come up with output below,
I have multiple databases in the server and all my databases have tables: stdVersions, stdChangeLog. The stdVersions table have field called DatabaseVersion which stored the version of the database. The stdChangeLog table have a field called ChangedOn which stored the date of any change made in the database.
I need to write a query/stored procedure/function that will return all the database names, version and the date changed on. The results should look something like this:
I am trying to find a way to add into a table a flattened (comma seperated list) of email addresses based on the multiple columns of nformation in another table (joined by customer_full_name and postcode.
This is to highlight duplicate email addresses for people under the same customer_full_name and Postcode.
I have done this using a loop which loops through concatenating the email addresses but it takes 1minute to do 1000. The table is 19,000 so this isn't really acceptable. I have tried temp tables, table variables and none of this seems to make any difference. I think that it is becuase i am joining on text columns?
I have a table with score info for each group, and the table also contains historical data, I need to get the ranking for the current week and previous week, here is what I did and the result is apparently wrong:
select CurRank = row_number() OVER (ORDER BY cr.CurScore desc) , cr.group_name,cr.CurScore , lastWeek.PreRank, lastWeek.group_name,lastWeek.PreScore from (select group_name, Avg(case when datediff(day, asAtDate, getdate()) <= 7 then sumscore else 0 end) as CurScore
[Code] ....
The query consists two parts: from current week and previous week respectively. Each part returns correct result, the final merged result is wrong.
Basically, I'm given a daily schedule on two separate rows for shift 1 and shift 2 for the same employee, I'm trying to align both shifts in one row as shown below in 'My desired results' section.
Sample Data:
;WITH SampleData ([ColumnA], [ColumnB], [ColumnC], [ColumnD]) AS ( SELECT 5060,'04/30/2015','05:30', '08:30' UNION ALL SELECT 5060, '04/30/2015','13:30', '15:30' UNION ALL SELECT 5060,'05/02/2015','05:30', '08:30' UNION ALL SELECT 5060, '05/02/2015','13:30', '15:30'
I've begun to get the above error from my package. The error message refers to two output columns.
Anyone know how this could happen from within the Visual Studio 2005 UI? I've seen the other posts on this subject, and they all seemed to be creating the packages in code.
Is there any way to see all of the columns in the data flow? Or is there any other way to find out which columns it's referring to? Thanks!
I found some duplicate data as I was going thru the logic of a data pump. The entire row is not duplicated however.I would like to delete only the one row.
This is a sample of the data: DECLARE @SomeData TABLE ( FirstName varchar(25) , MiddleName varchar(25) , LastName varchar(25) , StreetAddress varchar(25) , Suite varchar(25) , City varchar(25) , [State] varchar(25) , PostalCode varchar(10)
[code]...
As you can see, Joe Smith has two rows, but only one of the rows is complete. I would like to delete only the row that has a NULL value in the phone and area code for Joe Smith. There are a few thousand rows that are like this. They have duplicates all but the area code and phone number.I am used to using a CTE to remove duplicates, but I am a little lost on this one. The things that I have tried, have not worked exactly as I planned.
We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.
In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.
In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.
We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.
alter table mytbl drop column last_login_date select * from mytbl order by employee_id, effective_date
Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):
Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).
My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.
As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this.
any useful SQL Queries that might be used to identify lists of potential duplicate records in a table?
For example I have Client Database that includes a table dbo.Clients. This table contains various columns which could be used to identify possible duplicate records, such as Surname | Forenames | DateOfBirth | NINumber | PostalCode etc. . The data contained in these columns is not always exactly the same due to differences caused by user data entry; so some records may have missing data from some of the columns and there could be spelling differences too. Like the following examples:
1 | Smith | John Raymond | NULL | NI990946B | SW12 8TQ 2 | Smith | John | 06/03/1967 | NULL | SW12 8TQ 3 | Smith | Jon Raymond | 06/03/1967 | NI 99 09 46 B | SW12 8TQ
The problem is that whilst it is easy for a human being to review these 3 entries and conclude that they are most likely the same Client entered in to the database 3 times; I cannot find a reliable way of identifying them using a SQL Query.
I've considered using some sort of concatenation to a new column, minus white space and then using a "WHERE column_name LIKE pattern" query, but so far I can't get anything to work well enough. Fuzzy Logic maybe?
the results would produce a grid something like this for the example above:
ID | Surname | Forenames | DuplicateID | DupSurname | DupForenames 1 | Smith | John Raymond | 2 | Smith | John 1 | Smith | John Raymond | 3 | Smith | Jon Raymond 9 | Brown | Peter David | 343 | Brown | Pete D next batch of duplicates etc etc . . . .
Im trying to delete duplicate records from the output of the query below, if they also meet certain conditions ie 'different address type' then I would merge the records. From the following query how do I go about achieving one and/or the other from either the output, or as an extension of the query itself?
Hello, I have a table with say 45 columns. I have a business requirement that requires me to fetch the rows for which col1 , col2, ....col 11 are same and rest can be different. there is an identity column, in the table so I can have duplicate rows also.
how can I effectively write a query that will fetch me all those rows for which my 11 columns are same.
I have a business need to create a report by query data from a MS SQL 2008 database and display the result to the users on a web page. The report initially has 6 columns of data and 2 out of 6 have JSON data so the users request to have those 2 JSON columns parse into 15 additional columns (first JSON column has 8 key/value pairs and the second JSON column has 7 key/value pairs). Here what I have done so far:
I found a table value function (fnSplitJson2) from this link [URL]. Using this function I can parse a column of JSON data into a table. So when I use the function above against the first column (with JSON data) in my query (with CROSS APPLY) I got the right data back the but I got 8 additional rows of each of the row in my table. The reason for this side effect is because the function returned a table of 8 row (8 key/value pairs) for each json string data that it parsed.
1. First question: How do I modify my current query (see below) so that for each row in my table i got back one row with 19 columns.
SELECT A.ITEM1,A.ITEM2,A.ITEM3,A.ITEM4, B.* FROM PRODUCT A CROSS APPLY fnSplitJson2(A.ITEM5,NULL) B
If updated my query (see below) and call the function twice within the CROSS APPLY clause I got this error: "The multi-part identifier "A.ITEM6" could be be bound.
2. My second question: How to i get around this error?
SELECT A.ITEM1,A.ITEM2,A.ITEM3,A.ITEM4, B.*, C.* FROM PRODUCT A CROSS APPLY fnSplitJson2(A.ITEM5,NULL) B, fnSplitJson2(A.ITEM6,NULL) C
I am using Microsoft SQL Server 2008 R2 version. Windows 7 desktop.
Hello, I have a table T1 and on this table I have an insert trigger. In the trigger I need to check if T1.ID and T1.Type=’xyz’ together are duplicated or not (duplicate dhcek on two columns), if yes return error from trigger. There might be T1.ID and T1.Type=’abc’ duplicated, that is fine.
I have to write a query which extracts everyone from a table who has the same surname and forenames as someone else but different id's.
The query should have a surname column, a forenames column, and two id columns (from the person column of the table).
I need to avoid duplicates i.e. the first table id should only be returned in the first id column and not in the second - which is what i am getting at the mo.
This is what i have done
select first.surname, first.forenames, first.person, second.person from shared.people first, shared.people second where first.surname= second.surname and first.forenames = second.forenames and not first.person = second.person order by first.surname, first.forenames
and i get results like this
Porter Sarah Victoria 9518823 9869770 Porter Sarah Victoria 9869770 9518823 - i.e. duplicates
When I add this code in a view and try to save . . .SELECT TOP 610 *FROM dbo.Master INNER JOINdbo.TypeByCase ON dbo.TypeByCase.CaseNum = dbo.Master.CaseNumIt gives the error:ODBC error: [Microsoft][ODBC SQL Server Driver]Column names in eachview or function must be unique. Column name 'CaseNum' in view orfunction 'dbo.BobView1' is specified more than once.Any idea why?Thanks,RBollinger
Hi there,I would like to know how to get rows with duplicate values in certaincolumns. Let's say I have a table called "Songs" with the followingcolumns:artistalbumtitlegenretrackNow I would like to show the duplicate songs to the user. I considersongs that have the same artist and the same title to be the same song.Note: All columns do not have to be the same.How would I accomplish that with SQL in SQL Server?Thanks to everyone reading this. I hope somebody has an answer. I'vealready searched the whole newsgroups, but couldn't find the solution.
Hello,Suppose I have the following table...name employeeId email--------------------------------------------Tom 12345 Join Bytes!Hary 54321Hary 54321 Join Bytes!I only want unique employeeIds return. If I use Distinct it will stillreturn all of the above as the email is different/missing. Is there away to query in SQL so that only distinct employeeId is returned? noduplicates.I wouuld like to say WHERE no blank fields are present to get theright row to return.Many thanksYas