T-SQL (SS2K8) :: Multiple Contact / Possibilities - How To Remove Duplicates
Aug 11, 2015
I have a bunch of contacts that I've scored how well their names match to other contacts in the same business. I can programmatically figure out how to parse the results, but would like to know how to do this via SQL. My problem is for Business_fk 968976 I have 7 contacts. In the end I should have 4 contacts based on name match. For the business key listed Gerardo Lopez is in the ContactScore table twice for Contact keys 7355719 and 57028145. I then have two rows like so:
Each reference each other, and 2 is a good case, a more difficult case would have key 1 listed 10 times showing a ContactMatch_fk of 2 - 11, and then Contact_fk 2 listed 10 times with a ContactMatch_fk of 1, 3-11.I know 57028145 maps to 7355719 from the first row in the ContactScore table, so when Contact_fk of 7355719 comes up I should be able to skip it and not process that match. Hopefully that makes sense. Anyway here is the test data:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ContactScore]') AND type in (N'U'))
DROP TABLE [dbo].[ContactScore];
GO
CREATE TABLE [dbo].[ContactScore]
(
[ContactScore_pk]INT NOT NULL,
[Contact_fk]INT NOT NULL,
;WITH ctePreAgg AS ( select top 500 act_reference "ActivityRef", row_number() over (partition by act_reference order by act_reference) as rowno, t3.s_initials "Initials" from mytablestuff order by act_reference
[code]...
But what I would love to do next is take each of the above rows - and return the initials either in one column with all the nulls and duplicate values removed, separated by a comma ..
OR the above but using variable number of columns based on the maximum number of different initials for each row.this is not strictly required, but maybe neater for further work on the view
I have a table with 22 million Business records. I can see that there are duplicates when I group by BusinessName and Address and Phone. I'd like to place only the duplicates into a table, with a ranking, oldest business key gets a ranking of 1.
As a bonus I'd like each group to have a distinct group name (although not necessary, just want to know how to do this)
Later after I run more verifications to make sure these are not referenced elsewhere I'll delete everything with a matchRank > 1 out of the main Business table.
DROP TABLE [dbo].[TestBusiness]; GO CREATE TABLE [dbo].[TestBusiness]( [Business_pk] INT IDENTITY(1,1) NOT NULL, [BusinessName] VARCHAR (200) NOT NULL, [Address] VARCHAR(MAX) NOT NULL,
Split function. I have records of multiple users, the last value of every record is a contact number (10 Digits- Numeric), I want a split function which can take the whole text and split the records on the basis of contact number.
In order words i want SQL to locate the contact number and move to the next record after that and so on till the end of the text.
create table tbl_1 (txt varchar (max))
insert into tbl_1 values ('john asfasdf 535 summit ave franklin lks nj 15521 510_644_1079 na na 5,8/12 executive, finance finance and planning far 5537 21133 8.25 126 ronald d hensor jr. 5575621596
[Code] .....
Output john jimenez 535 summit ave franklin lks nj 15521 510_644_1079 na na 5,8/12 executive,finance finance and planning far 5537 21133 8.25 126 ronald d hensor jr. 5575621596 jeffrey galione 57 allen dr wayne nj 15810 562_434_0710 na na 5,8/12 executive, technical sales and support good 8137 91630 8.25 126 eileen oneal 8258364083
Hi All I have the dbo.OperatingHour It has many duplicates and I want to remove duplicates permanently The statement below works but when I open the table there are no changes
Insert into OperatingHour(Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays]) (SELECT DISTINCT Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays] FROM OperatingHour)
Welcome,how can I alter following table in order to reduce neighbouringduplicates (symbol, position, quantity, price).Nr Symbol Position QuantityPrice Date1. wz9999b 1 1.02500.0 2007-05-09 08:09:42.6532. wz9999b 2 12.02500.0 2007-05-09 08:09:42.6533. wz9999b 1 100.02590.0 2007-05-10 15:47:04.1404. PZ0008VX 1 2280.8842090.55000000000022007-05-1612:43:12.4035. PZ0008VX 1 2280.8842102.05000000000022007-05-1612:45:27.4206. wz9999b 1 0.0012500.0 2007-05-18 09:47:16.0337. wz9999b 1 0.0012500.0 2007-05-18 09:47:53.2708. wz9999b 1 1.01.0 2007-05-22 12:35:07.8939. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:26.16010. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:38.80011. wz9999b 1 0.001 2500.02007-05-24 12:35:07.20712 wz9999b 1 0.002 2500.02007-05-24 12:35:14.98713. wz9999b 1 0.001 2500.02007-05-24 12:38:07.207In the result set I would like to get the rows number 6 and 10.Any suggestions??
I have a situation where we get XML files sent daily that need uploading into SQL Server tables, but the source system producing these files sometimes generates duplicate records in the file. The tricky part is, that the record isn't entirely duplicated. What I mean, is that if I look for duplicates by grouping the key columns, having count(*) > 1, I find which ones are duplicates, but when I inspect the data on these duplicates, the other details in the remaining columns may differ. So our rule is: pick the first record, toss the rest of the duplicates.
Because we don't sort on any columns during the import, the first record kept of the duplicates is arbitrary. Again, we can't tell at this point which of the duplicated records is more correct. Someday down the road, we will do this research.
Now, I need to know the most efficient way to accomplish this in SSIS. If it makes it easier, I could just discard all the duplicates, since the number of them is so small.
If the source were a relational table, I could use a SQL statement to filter the records to remove the duplicates, but since the source is an XML file, I don't know how to filter these out in the pipeline, since the file has to be aggregated to search for dups.
DELETE FROM tblContacts WHERE tblContacts.ID IN( SELECT F.ID FROM tblContacts AS F WHERE Exists ( SELECT email, Count(ID) FROM tblContacts WHERE tblContacts.email = F.email GROUP BY tblContacts.email HAVING Count(tblContacts.ID) > 1 ) ) AND tblContacts.ID NOT IN( SELECT Min(ID) FROM tblContacts AS F WHERE Exists ( SELECT email, Count(ID) FROM tblContacts WHERE tblContacts.email = F.email GROUP BY tblContacts.email HAVING Count(tblContacts.ID) > 1 ) GROUP BY email )
I readily admit that I've shamelessly copied 'n pasted this from a tutorial and then taken a stab at tweaking it for my own ends. But I really don't understand what it's doing.
Really, all I want to know is that it will remove records with duplicate email fields. But I could also do with confirming - looking at the "SELECT Min(ID)" bit - does that mean that if it finds a duplicate, it'll delete the latest-added one? And if so, that changing it to remove the earliest-added one is simply a case of changing MIN to MAX?
I am working with a bunch of records that have duplicates on the Persid and the intPercentID where there are duplicates I want to remove when I stick them in the temp table, I tried join on tempo table and doing not exists but still inserts, so now I am trying a merge but same thing. how can I keep duplicates from being inserted in the temp table. I made a cursor as well but its slow as heck, but it does work. trying better ways.
Create table #TempStr (STRId int not null Identity(1,1) primary key, Persid int, percentId int, dtCreated datetime, CreatedBy int)
INSERT #TempStr (Persid, percentId, dtCreated, CreatedBy) select intPersonnelID, intPercentID, dtSubmitted, intSubmittedBy from tblSTR whereintpercentId in (61,62) group by intPercentID, intPersonnelID, dtSubmitted, intSubmittedBy UNION ALL
I have table with columns as ID, DupeID1, DupeID2. ID column is unique. DupeID1 and DupeID2 -- the combination should only be there once. I don't want reverse combination of duplicates, i.e. DupeID2, DupeID1 in the table. How can I delete the reverse duplicates from this table?
Product No Grade Quantity A Good A Normal A Bad B Good B Bad C Good C Normal C Bad
In Table 2, Product No divided by Grade. I want to lookup the Quantity from Table 1 to Table 2. The same Product No will have 1 value, the other value is 0. The result for Column Quantity should be like this:
Table 2:
Product No Grade Quantity A Good 1 A Normal 0 A Bad 0 B Good 2 B Bad 0 C Good 3 C Normal 0 C Bad 0
Within the LinkingID, there are duplicates in ID1 and ID2 but just in opposite columns. I have been trying to figure out a way to remove these set based. It doesn't matter which duplicate is removed. Essentially these are just endpoints and I don't care which side they are on. The solution must recognize the duplicates and not just remove based on every 2nd row.
How can i perform this task with ssis OR TRANSACT SQL? I HAVE THESE ROWS WITH THE NEXT DATA, I want to take just the valid one, BUT I HAVE A LOT OF COMBINATIONS AS following names, it can be animals, things or personal names
GABRIEL OBANDO --CORRECT GABRIEL OVANDO Gavriel OVANDO gAbriel OBANDO GABRIE OBANDO Gabri OBONDA MANAGUA --CORRECT NANAGUA NAMAGUA
Im working through the MS example of "removeDuplicates". I cant seem to figure out how to add custom property for input column.
I added the helper method: private static void AddIsKeyCustomPropertyToInput(IDTSInput90 input, object value) { IDTSCustomProperty90 isKey = input.CustomPropertyCollection.New(); isKey.Name = "IsKey"; isKey.Value = value; } I call it from: public override void ProvideComponentProperties() { //... AddIsKeyCustomPropertyToInput(input, false); //... } public override void ReinitializeMetaData() { IDTSInput90 input = ComponentMetaData.InputCollection[0]; if (input.CustomPropertyCollection.Count == 0) { AddIsKeyCustomPropertyToInput(input, false); } // ... }
However when I deployed it and added the component to SSIS package - I cant see the Custom Column "IsKey" in the input column properties window. What am I missing - please help
We all were new at one point.... any help is appreciated.
Objective:
Combining two 49,000 row tables and remove records where there is only 1 column difference. (keeping the specified column value removing the one with a blank.)
Reason:
I have 2 people going through a list, coding a specific column with a single letter value. They both have different progress on each sheet. Hence I am trying to UNION them and have a result of their combined efforts without duplicates.
My progress/where I'm stuck:
Here is my first query/union:
SELECT * FROM [Eds table] UNION SELECT * FROM [Vickis table];
As shown above, I have unioned these 2 tables and my results removed th obvious whole record duplicates, but since 1 column is different on these, a union without criteria considers them unique.....
an example of duplicates that I must remove are as follows:
I had Excel file input & import to DB Table by using Data flow in SSIS.but it had duplicates so I dont use the Dupe Records
So I planned like below:
Method 1: Here OLEDB Destination are Good Records(Without Duplicates) OLEDB Destination are Not Good Records(only Duplicates) or Method :2 If I add a column(GOOD_RECORD) in DB Table and Should I update '1' for top 1 record (for Good Record) and remaining as '0' for other Records (for Dups)latter I utilize Through flag of GOOD_RECORD
i.e.,, select * from DB_TABLE where GOOD_RECORD='1' .
I think that Method :2 Advisable for Performance/flexible but Here How can I update by using SSIS(Data flow) ????
I have some duplicate values for my query results, about 200 duplicates out of 30000 rows. Of these 200 duplicates I want to keep the ones that have a higher value for... 'UpdatedBatchID'.
SELECT IR.Id as 'ID' , CAST(IR.Priority as varchar) as 'Priority' , IRSupportGroupDN.DisplayName as 'Support Group' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.CreatedDate) as 'Created Date' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.ResolvedDate) as 'Resolved Date' , SLOConfig.DisplayName as 'SLO' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),SLOFact.TargetEndDate) as 'SLO Target' , SLOStatusDN.DisplayName as 'SLO Status' , SLOMetric.DisplayName as 'SLO Metric' , SLOFact.UpdatedBatchId as 'UpdatedBatchID'
What happens when you add the Ignore Case flag into the mix?
I'm having a hell of a time - I'm dealing with an SCD situation using TableDifference component and I have both existing dimensions and new data coming in, each go through identical Case-Insensitive/Sort with remove duplicates, but I'm getting identical new and deleted records detected - I think because of ordering issues. I'm still trying to whittle the test case down, but I think data from all around the records I'm investigating seems to get sorted in between them, so I'm having trouble getting a small test case built.
I think the mixed case data is the root of the problem, and I think the design is bad, but before I go back to the technical lead, I need to understand enough to show that you cannot take two pipelines sorted and de-duped case-insensitively and then do a case-sensitive table difference operation.
I have the piece of sql code here below that keeps giving out duplicates. How to resolve this.
isnull((select distinct (SUM(a1.ActualDebit) - SUM(a1.ActualCredit) ) from #MainAccount a1 LEFT OUTER JOIN #BudgetAccount bb ON aa.AccountID = bb.AccountID AND a1.PeriodStartdate = bb.PeriodStartDate and a1.DateMonth=bb.DateMonth and a1.Budget = bb.Budget WHERE a1.AccountID = aa.AccountID and a1.Refdate >= @FROMDATE and a1.Refdate <= @TODATE GROUP BY a1.group1, a1.Group2),0) As Actual_CurrentMonth,
WITH cte_OrderProjectType AS ( select Orderid, min(TypeID) , min(CTType) , MIN(Area) from tableA A inner join tableB B ON A.PID = B.PID left join tableC C ON C.TypeID = B.TypeID LEFT JOIN tableD D ON D.AreaID = B.ID group by A.orderid )
This query uses min to eliminate duplicates. It takes 1.30 seconds to complete..
Is there any way I can improve the query performance ?
I have tried and attached the computed results and also expecting results.
IF OBJECT_ID('tempdb..#tmpExam1')IS NOT NULL DROP TABLE #tmpExam1 IF OBJECT_ID('tempdb..#tmpExam2')IS NOT NULL DROP TABLE #tmpExam2 IF OBJECT_ID('tempdb..#tmpExam3')IS NOT NULL DROP TABLE #tmpExam3
Auto_ID Account_ID Account_Name Account_Contact Priority 1 3453463 Tire Co Doug 1 2 4363763 Computers Inc Sam 1 3 7857433 Safety First Heather 1 4 2326743 Car Dept Clark 1 5 2342567 Sales Force Amy 1 6 4363763 Computers Inc Jamie 2 7 2326743 Car Dept Jenn 2
I'm trying to delete all duplicate Account_IDs, but only for the highest priority (in this case it would be the lowest number).
I know the following would delete duplicate Account_IDs:
DELETE FROM staging_account WHERE auto_id NOT IN (SELECT MAX(auto_id) FROM staging_account GROUP BY account_id)
The problem is this doesn't take into account the priority; in the above example I would want to keep auto_ids 2 and 4 because they have a higher priority (1) than auto_ids 6 and 7 (priority 2).
How can I take priority into account and still remove duplicates in this scenario?
I have a patient record and emergency contact information. I need to find duplicate phone numbers in emergency contact table based on relationship type (RelationType0 between emergency contact and patient. For example, if patient was a child and has mother listed twice with same number, I need to filter these records. The case would be true if there was a father listed, in any cases there should be one father or one mother listed for patient regardless. The link between patient and emergency contact is person_gu. If two siblings linked to same person_gu, there should be still one emergency contact listed.
Below is the schema structure:
Person_Info: PersonID, Person Info contains everyone (patient, vistor, Emergecy contact) First and last names Patient_Info: PatientID, table contains patient ID and other information Patient_PersonRelation: Person_ID, patientID, RelationType Address: Contains address of all person and patient (key PersonID) Phone: Contains phone # of everyone (key is personID)
The goal to find matching phone for same person based on relationship type (If siblings, then only list one record for parent because the matching phones are not duplicates).
I found some duplicate data as I was going thru the logic of a data pump. The entire row is not duplicated however.I would like to delete only the one row.
This is a sample of the data: DECLARE @SomeData TABLE ( FirstName varchar(25) , MiddleName varchar(25) , LastName varchar(25) , StreetAddress varchar(25) , Suite varchar(25) , City varchar(25) , [State] varchar(25) , PostalCode varchar(10)
[code]...
As you can see, Joe Smith has two rows, but only one of the rows is complete. I would like to delete only the row that has a NULL value in the phone and area code for Joe Smith. There are a few thousand rows that are like this. They have duplicates all but the area code and phone number.I am used to using a CTE to remove duplicates, but I am a little lost on this one. The things that I have tried, have not worked exactly as I planned.
SELECT DISTINCT S.EnrollNo ,S.Name ,ET.Descriptions AS EventName ,SA.Name AS AttendStudent ,'' AS AttendFaculty FROM StudentEvent SE INNER JOIN SStudent S ON SE.PresentatorID = S.StudentID
select top 5000 textdata,substring(textdata,charindex('exec',textdata)+5,charindex('@',textdata)-1) from trace_table where TextData like '%sp_%' and TextData like '%declare%'