Transact SQL :: Comparing Records - Finding Matches / Duplicates
Nov 20, 2015
I have this 40,000,000 rows table... I am trying to clean this 'Contacts' table since I know there are a lot of duplicates.
At first, I wanted to get a count of how many there are.
I need to compare records where these fields are matched:
MATCHED: (email, firstname) but not MATCH: (lastname, phone, mobile).
MATCHED: (email, firstname, mobile)
But not MATCH: (lastname, phone)
MATCHED: (email, firstname, lastname)
But not MATCH: (phone, mobile)
I need to compare records between two tables. There is no ID in the tables to do a simple join between them. So, what I'm looking for is: get the first record from table1 and read all record from table2 and give me back the most similar record. The String Distance is a predefined function.
Select a.table1 ,b.table2 from table1 a, table2 b where StringDistance (''a.table1,'b.table2') >90
I have a scenario to compare previous records based on each ID columns. For each ID, there would be few records, I have a column called "compare", We have to compare all Compare 1 records with Compare 0 Records. If Dt is lesser or equal to comparing DT, then show 0. Else 1
We always only one Compare 0 records in my table, so all compare 1 columns will compare with only one row per ID
My tables look like
Declare @tab1 table (ID Varchar(3), Dt Date, Compare Int) Insert Into @tab1 values ('101','2015-07-01',0) Insert Into @tab1 values ('101','2015-07-02',1) Insert Into @tab1 values ('101','2015-07-03',1) Insert Into @tab1 values ('101','2015-07-01',1) Insert Into @tab1 values ('101','2015-06-30',1)
Insert Into @tab1 values ('102','2015-07-01',0) Insert Into @tab1 values ('102','2015-07-02',1) Insert Into @tab1 values ('102','2015-07-01',1)
select * from @tab1
1.) In the above scenario for ID = '101', we have 5 records, first record has Compare value 0, which mean all other 4 records need to compare with this record only
2.) If Compare 1 record's Dt is less or equal to Compare 0's DT, then show 0 in next column
3.) If Compare 1 record's Dt is greater than Compare 0's DT, then show 1 in next column
I have a problem where I have 2 compare 2 records from the same table. This part looks easy but the problem is for a User there can be multiple records and I have 2 compare each record with its previous instance based on the timestamp. Not only I have to compare I have to perform some analysis. Below is the Table script and sample output.
Givens: All SQL Server 2008 or 2012 tools at your disposal.
Production database contains the following tables (simplified for example: constraints ignored, etc.) associated with a racing video game’s server.
-- A player of our game
-- Table greater than 10 million rows
CREATE TABLE [dbo].[User] ( [UserId] [bigint] NOT NULL ,[country] [int] NULL -- User’s home country ,[name] [nvarchar](15) NULL -- User’s displayable name (‘John’, ‘Bill’) ,[subscriptionTier] [int] NULL ) -- 0 == free, 1 == paid, for instance
Assume that rows get written into the event tables at a rate of 1,000 a minute,are never updated once written and currently are only read on a replica/reporting server.
Question Background: Write up a single query that would return the following: List of users and whose “TotalMoneyEarned” value ever grew (between logon events) at a rate of more than 1,000 per minute (we’d consider these suspicious and flag them for later investigation).
For instance, if the sample data were:
-- example of [Events.UserLogon] data -- not the query output we want
Event 1 is okay because there’s nothing to compare it against
Event 2 is okay because the TotalMoneyEarned only grew 500 in a minute
Event 3 should be flagged, as the value grew 1500 in a minute
Event 4 is okay, as it grew 7,000 in 8 minutes (< 1000 per minute)
Query Output (your query should return data in a format like this):
User Flagged Logon Time Rate Since Last Logon (money/minute) John 2010-10-16 00:21:56 1500 Dave 2010-10-16 00:30:50 3200 Bill 2010-10-16 00:35:23 1000
It is likely that you will need to create sample data for both the User and [Events.Logon] tables. We are looking for a single query that returns data like what is represented in Query Output.
How do I only select rows with duplicate dates for each person (id)? (The actual table has approximately 13000 rows with approximately 3000 unique ids)
If it possible to have an if statement match multiple results, as to not have to use the OR multiple times.
Example: I want to say, if Description equals red or blue or green or yellow or orange or black or white or pink, without having to use OR and OR and OR. Description can match 10 different values.
I have a customer database with the following structure:
Dept ID (int) Section (varchar)
I need to find only occurrences of a section (eg Admin) where the section name has a record in Dept 1,2 and 3 - only return the result if the record for Admin is associated will these depts.
I am trying to complete an insert from query but the problem is I have duplicates, so I'm getting an error message. So to correct it I am creating a Find Duplicates statement in the Query analyzer but Its not working can someone tell me whats wrong with this statement (by the way I'm in SQL 2000 Server)
thank you
SELECT EmployeeGamingLicense [TM#]AS [TM# Field], Count([TM#])AS NumberOfDups FROM TERMINATION GROUP BY [TM#] HAVING Count([TM#])>1; GO
I was just wondering if any one knows how to find duplicate keys using more than one field. I used the below key to find those people who exists in list1 but don't exists in list2. I realized that the results had some duplicates which was expected but how do I then find all those duplicate people. I know how to do it if there was a primary key present I would have done a count (distinct cardnumber) > 1 and i would have done the select statement like this distinct cardnumber, but how do I do it with more that one key??
I have a company table and I would like to write a query that will return tome any duplicate companies. However, it is a little more complicated thenjust matching on exact company names. I would like it to give me duplicateswhere x number of letters at the beginning of the company name match AND xnumber of letters of the address match AND x number of letters of the citymatch. I will be doing this in batches based on the first letter of thecompany name. So for example I will first process all companies that startwith the letter "A".So for all "A" companies I want to find companies where the first 5 lettersin the company name match and the first 5 characters of the address fieldmatch and the first 5 characters of the city match. THANKS!!!
I'm trying to find duplicates. Any help will be greatly appreciated! Here's my query:
I'm trying to find only the records where the "Sales_Header_Your_Reference" field is used more then once with a different "Sales Header Document ID" So it looks like this Sales Header Document ID Sales_Header_Your_Reference 1955718 0002377729 2082721 0002478945 2082728 0002478976 2093598 0002487318 2093599 0002487318 2093601 0002487332 I hope im clear enough, here is my query. I get then listed right, but the count is all wrong, I only want to show the ones , like the one i have highlighted.
Code Block
Select COUNT(Sales_Header_Your_Reference),Tbl_Sales_Header.Sales_Header_Document_Id,Tbl_Sales_Header.Sales_Header_Your_Reference FROM dbo.Tbl_Sales_Header INNER JOIN dbo.Tbl_Sales_Invoice_Header ON dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = dbo.Tbl_Sales_Invoice_Header.Sales_Invoice_Header_Bill_Customer_No AND dbo.Tbl_Sales_Header.Sales_Header_Order_DateTime = dbo.Tbl_Sales_Invoice_Header.Sales_Invoice_Header_Order_Datetime WHERE (dbo.Tbl_Sales_Header.Sales_Header_Order_DateTime > CONVERT(DATETIME, '2007-09-15 00:00:00', 102)) AND (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'Butler') OR (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'MWI') OR (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'NLS') OR (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'HSI') OR (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'NLSMVD') OR (dbo.Tbl_Sales_Header.Customer_Bill_Customer_No = 'RNEXT') GROUP by Tbl_Sales_Header.Sales_Header_Document_Id,Tbl_Sales_Header.Sales_Header_Your_Reference HAVING COUNT(Sales_Header_Your_Reference) > 1 order by Tbl_Sales_Header.Sales_Header_Document_Id, Tbl_Sales_Header.Sales_Header_Your_Reference
Problem is that if the [Receiving] table doesn't have a match then no records are return. I want all matches from the [Orders Subtable] and any matches from the [Receiving] Table. If no [Receiving] table matches then I still want all matches from the [Orders Subtable]. Attached is the query.
Note: The query has to run in Access 2000 and I will be coding it in VB.
FROM (Orders LEFT JOIN Receiving ON Orders.[Orders ID] = Receiving.[Orders ID]) INNER JOIN [Orders Subtable] ON Orders.[Orders ID] = [Orders Subtable].[Orders ID]
GROUP BY Orders.[Orders ID], [Orders Subtable].ID, [Orders Subtable].Quantity, Receiving.Quantity, Orders.[Project #], [Orders Subtable].On_Order, [Orders Subtable].[Component #], Receiving.[Component #]
HAVING (((Orders.[Project #])="Speed1aaaaa") AND (([Orders Subtable].On_Order)=True) AND (([Orders Subtable].[Component #])="R02101A") AND ((Receiving.[Component #])="R02101A"));
I am trying to move distinct data from one table to another based on two columns regardless of special characters such as : ( ) ; ' " / - _ = >< etc. The two columns that are the defining factors that I need to match as duplicates are Col1 & Col3, the rest I want to get the maximum data from each column for that row (I hope this made sense).
Here is what I have tried, but it does not seem to work:
Code:
select ID, Col1, max(Col2)as Col2, Col3, max(Col4)as Col4, max(Col5)as Col5, max(Col6)as Col6, max(Col7)as Col7, max(Col8)as Col8, max(Col9)as Col9 where patindex('%[^-:]%',Col1) = 0 AND patindex('%[^-:]%',Col3) = 0 into Newtable from dbo.CLEANING_bk group by Col1,Col3
Basically if there are 10 rows of duplicate data based on Col1 & Col3 (regardless of special characters), I want to move the distinct info to a new table with the maximum amount of info from the other columns of the duplicate rows.
If row 3, 4,5, & 6 are all considered duplicate and Row3/Col2 has info in it that row 4 & 5/Col2 does not, I want to combine it with the single row I export to the new table retaining as much information from the rows of the duplicates as possible.
Can anyone help and let me know why my script is not wanting to play nice?
I have this project; I am trying to work on but cant because of duplicated data in that table.
I tried to make the culume not accept duplicates, but I need to clean it up first and I can just delete them.
Currently the values are 11 digit numbers and I need to replace them with 00000000001
incrementing, making my last one 00000003000
Any idea on where to even begin? I am new to this.
What I am trying to do.
Find replicates numbers in TRAN_DATA
Replace each with a new number but increment the new number starting with 00000000001 and ending with 00000003000
With this I knew I had 3000 plus. ELECT TRAN_LCTR_NR, COUNT(TRAN_LCTR_NR) AS NumOccurrences FROM TRAN_DATA GROUP BY TRAN_LCTR_NR HAVING ( COUNT(TRAN_LCTR_NR) > 1 )
Planning - contains a list of planned items. Used to define boundaries for a work day and defines based on type what can be done for each item.
Id, TypeId - the type of the planned items BeginTime DateTime - begin date and time of the planned item EndTime DateTime - end date and time for the planned item
In the Planning table we can have as many records per day as we need:
1, First Meeting, 1 Jan 2008 09:00, 1 Jan 2008 11:00 2, First Meeting, 1 Jan 2008 11:00, 1 Jan 2008 12:00 3, First Meeting, 1 Jan 2008 13:00, 1 Jan 2008 15:00 4, First Meeting, 1 Jan 2008 15:00, 1 Jan 2008 18:00
Appointments - contanis a list with appointments
Id, BeginTime DateTime EndTime DateTime
1, 1 Jan 2008 09:00, 1 Jan 2008 09:30 2, 1 Jan 2008 10:00, 1 Jan 2008 11:00 3, 1 Jan 2008 11:00, 1 Jan 2008 11:30 4, 1 Jan 2008 14:00, 1 Jan 2008 15:30
What is needed?
What I need is to a find a way to compare the planned items with the appointments and to return all the periods for which a planned time exists:
Free planned time:
1, 1 Jan 2008 09:30, 1 Jan 2008 10:00 2, 1 Jan 2008 11:30, 1 Jan 2008 12:00 3, 1 Jan 2008 13:00, 1 Jan 2008 14:00 4, 1 Jan 2008 15:30, 1 Jan 2008 18:00
So, having two multitudes of periods,where the one specifies the planning templates and the other real used time, I need to find all the periods which can be used for another appointments.
I've tried several aproaches, but I always faced performance problems.
Hi all, I have two tables - Planning and Appointments:
Planning - contains a list of planned items. Used to define boundaries for a work day and defines based on type what can be done for each item. Id, TypeId - the type of the planned items
BeginTime DateTime - begin date and time of the planned item EndTime DateTime - end date and time for the planned item
In the Planning table we can have as many records per day as we need:
1, First Meeting, 1 Jan 2008 09:00, 1 Jan 2008 11:00 2, First Meeting, 1 Jan 2008 11:00, 1 Jan 2008 12:00 3, First Meeting, 1 Jan 2008 13:00, 1 Jan 2008 15:00 4, First Meeting, 1 Jan 2008 15:00, 1 Jan 2008 18:00
Appointments - contanis a list with appointments Id,
BeginTime DateTime EndTime DateTime
1, 1 Jan 2008 09:00, 1 Jan 2008 09:30 2, 1 Jan 2008 10:00, 1 Jan 2008 11:00 3, 1 Jan 2008 11:00, 1 Jan 2008 11:30 4, 1 Jan 2008 14:00, 1 Jan 2008 15:30
What is needed? What I need is to a find a way to compare the planned items with the appointments and to return all the periods for which a planned time exists:
Free planned time: 1, 1 Jan 2008 09:30, 1 Jan 2008 10:002, 1 Jan 2008 11:30, 1 Jan 2008 12:00 3, 1 Jan 2008 13:00, 1 Jan 2008 14:00 4, 1 Jan 2008 15:30, 1 Jan 2008 18:00
So, having two multitudes of periods,where the one specifies the planning templates and the other real used time, I need to find all the periods which can be used for another appointments. I've tried several aproaches, but I always faced performance problems.
What is the best way to compare two entries in a single table wherethe two fields are "almost" the same?For example, I would like to write a query that would compare thefirst two words in a "company" field. If they are the same, I wouldlike to output them.For example, "20th Century" and "20th Century Fox" in the companyfield would be the same.How do I do this? Do I need to use a cursor? Is it as simple as using"Like?"
I am trying to compare two flat files and extract new entry into new file.But in my case there is no key column in both flat files. is any way to find the new entry by checksum with out Key matching?.
There are many duplicate records on my data table because users constantly register under two accounts. I have a query that identify the records that have a duplicate, but it only shows one of the two records, and I need to show the two records so that I can reconcile the differences.The query is taken from a post on stack overflow. It gives me 196, but I need to see the 392 records.
How to identify the duplicates and show the tow records without having to hard code any values, so I can use the query in a report, and anytime there are new duplicates, the report shows them.
Hey Guys, I have a contacts table that contains ID, First Name, Last Name, and Phone Number, Date Entered, Changed. Every time, the data is modified and saved, it will insert a new record in the table. So, Ill create a new record for a contact named Ryan, and then come back a day later and update the last name and phone number. So theSQL table would look like...1 Ryan Scott 818-550-0000 05/08/2008 Null2 Ryan Peters 000-000-0000 05/09/2008 Null How do I write a sql query that will run an update after the insert of the second record to fill in the Changed field with the data that changed?So I want to have record 2, end up looking like this... 2 Ryan Peters 000-000-0000 05/09/2008 LastName,PhoneNumberAny ideas?
Thanks for your help... I have two databases in two different servers, I am running this script which shows customers not in the second server. I am getting an error shown below. any idea of how to solve this issue. Ali
CREATE view v_show_customers_not_in_GP as select customer_id,company_name,contact_fname,contact_lna me,phone,alt_phone,fax,email,street_1,street_2,cit y,c.name,s.code as state,zip_code FROM customer v,country c, state s WHERE v.country_id =c.country_id and v.state_id = s.state_id and convert(char(15),customer_id ) NOT IN (select custnmbr from servername.dbname.dbo.RM00101 )
Server: Msg 18452, Level 14, State 1, Line 1 Login failed for user '(null)'. Reason: Not associated with a trusted SQL Server connection.
I have a date comparison situation in which I will have a column with a date and will have another value containing a GETDATE() minus two weeks. I want to be able to compare both dates and get the MIN date between the two. I'm not sure how to use the MIN(Date) in this scenario since the comparison won't be between two different columns, but between one column and a random date generated by the GETDATE() minus two weeks.
Serial Count 001 2 the count is 2 because Serial 001 has an MSDSID of 20 and 22 002 1 the count is 1 because Serial 002 only has MSDSID 21 003 2 the count is 2 because Serial 003 has an MSDSID of 21 and 22 004 1 the count is 1 because Serial 002 only has MSDSID 23
It would be even better if the results just showed where the count is greater than 1.
I'm trying to pull records from a source/staging table where there is a duplicate row in it.I don't need that as the requirement is to garbage in /garbage out.when I do that from mart and use joins btw fact and dimensions, Im not getting this duplicate record as Im using distinct/group by. If I removed it, then it returns more than 3000 rows which is not correct. Is there a way I can keep these duplicates without removing group by...Im using correct joins and filters.
I am doing some audit and i have below query, how can i get rid of duplicates from the below query any T SQL to get rid of duplicates...
I am using SP_Who2 and sql server Audit for auditing all data happening on sql server databases and dumping them to tables Audit_DBAudit abd Audit_sp_who2 and from then i am trying to get data which is not repeating/duplicate ...
SELECT A.ProgramName ,a.HostName,[Server_principal_name],[Server_instance_name],[Database_name],[Object_name],F.Statement FROM Audit_DBAudit as F Join [Audit_sp_who2] AS a on LTRIM(RTRIM(F.server_principal_name))=LTRIM(RTRIM(A.Login))