Sort Component, Remove Duplicates, Comparison Flags - Ignore Case
Feb 29, 2008
What happens when you add the Ignore Case flag into the mix?
I'm having a hell of a time - I'm dealing with an SCD situation using TableDifference component and I have both existing dimensions and new data coming in, each go through identical Case-Insensitive/Sort with remove duplicates, but I'm getting identical new and deleted records detected - I think because of ordering issues. I'm still trying to whittle the test case down, but I think data from all around the records I'm investigating seems to get sorted in between them, so I'm having trouble getting a small test case built.
I think the mixed case data is the root of the problem, and I think the design is bad, but before I go back to the technical lead, I need to understand enough to show that you cannot take two pipelines sorted and de-duped case-insensitively and then do a case-sensitive table difference operation.
I am comparing two fields one from our legacy table and one in our new table structure that should have identical text data. The new field has an assortment of ANSI characters where the legacy data did not have these. Is there anything I can do that will ignore all ansi character differences? The only route I can think of is just do a replace on each ANSI type on the new column but there are quite a few character types.
I have a simply SSIS package with following data flow structure:
Flat File Source > Data Conversion > Aggregate > Dervied Column > Ole dB Destination
Basically, this package is executed on daily basis to import sales from a text file into sql server. Now there's a possibility that text file may contain previous sales (occasionally).
My sql table structure enforces data integrity through primary key and therefore my package errors out if there's a duplicate in text file which already exists in sql server.
I'm basically looking for a way to ignore these duplicates and continue to import rest of the file. I need a way to force execution (suppress errors if possible) and finish importing all text file.
I've tried making the maximumerrorcount more than # 50000 and failparent/failpackage on error = false.
Any help is greatly appreciated...thanks
Here's the errors I receive:
SSIS package "Package2.dtsx" starting. Information: 0x4004300A at Data Flow Task, DTS.Pipeline: Validation phase is beginning. Information: 0x40043006 at Data Flow Task, DTS.Pipeline: Prepare for Execute phase is beginning. Information: 0x40043007 at Data Flow Task, DTS.Pipeline: Pre-Execute phase is beginning. Information: 0x402090DC at Data Flow Task, Source - SALES_TXT [1]: The processing of file "Z:SALES.TXT" has started. Information: 0x4004300C at Data Flow Task, DTS.Pipeline: Execute phase is beginning. Information: 0x402090DE at Data Flow Task, Source - SALES_TXT [1]: The total number of data rows processed for file "Z:SALES.TXT" is 20450. Information: 0x402090DF at Data Flow Task, Destination - SALES [37]: The final commit for the data insertion has started. Error: 0xC0202009 at Data Flow Task, Destination - SALES [37]: An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "The statement has been terminated.". An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Violation of PRIMARY KEY constraint 'PK_Sale'. Cannot insert duplicate key in object 'dbo.Sale'.". Information: 0x402090E0 at Data Flow Task, Destination - SALES [37]: The final commit for the data insertion has ended. Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: The ProcessInput method on component "Destination - SALES" (37) failed with error code 0xC0202009. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. Error: 0xC0047021 at Data Flow Task, DTS.Pipeline: Thread "WorkThread1" has exited with error code 0xC0202009. Information: 0x40043008 at Data Flow Task, DTS.Pipeline: Post Execute phase is beginning. Information: 0x402090DD at Data Flow Task, Source - SALES_TXT [1]: The processing of file "Z:SALES.TXT" has ended. Information: 0x40043009 at Data Flow Task, DTS.Pipeline: Cleanup phase is beginning. Information: 0x4004300B at Data Flow Task, DTS.Pipeline: "component "Destination - SALES" (37)" wrote 18750 rows.
I am using Microsoft SQL Server Integration Services Designer Version 9.00.1399.00.
My OLE DB Destiantion component is inserting data into a table. When there is duplicate it will failed. But I want ignore this since I know data is same. But even I set Ignore Failure from OLE DB Destioantion Editor. It will not work and every time you reopen editor, the buttom drop down box always showing as " Fail componet". When I try run it, it will always fail on the inserting duplicated rows.
Does anyone know how can I tell servies ignore this error.
Hi,I believe my SQL server was configured as Case sensitivity. I have anumber of stored procedures which were moved from a non-Casesensitivity SQL server. Because of the Case sensitivity, I have to doa lot of editing in those stored procedures. Is there a quick way toavoid the editing?Something like ignoring the case in one statement?Thanks in advance, your advice will be greatly appreciated.
Hi All I have the dbo.OperatingHour It has many duplicates and I want to remove duplicates permanently The statement below works but when I open the table there are no changes
Insert into OperatingHour(Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays]) (SELECT DISTINCT Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays] FROM OperatingHour)
Welcome,how can I alter following table in order to reduce neighbouringduplicates (symbol, position, quantity, price).Nr Symbol Position QuantityPrice Date1. wz9999b 1 1.02500.0 2007-05-09 08:09:42.6532. wz9999b 2 12.02500.0 2007-05-09 08:09:42.6533. wz9999b 1 100.02590.0 2007-05-10 15:47:04.1404. PZ0008VX 1 2280.8842090.55000000000022007-05-1612:43:12.4035. PZ0008VX 1 2280.8842102.05000000000022007-05-1612:45:27.4206. wz9999b 1 0.0012500.0 2007-05-18 09:47:16.0337. wz9999b 1 0.0012500.0 2007-05-18 09:47:53.2708. wz9999b 1 1.01.0 2007-05-22 12:35:07.8939. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:26.16010. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:38.80011. wz9999b 1 0.001 2500.02007-05-24 12:35:07.20712 wz9999b 1 0.002 2500.02007-05-24 12:35:14.98713. wz9999b 1 0.001 2500.02007-05-24 12:38:07.207In the result set I would like to get the rows number 6 and 10.Any suggestions??
I have a situation where we get XML files sent daily that need uploading into SQL Server tables, but the source system producing these files sometimes generates duplicate records in the file. The tricky part is, that the record isn't entirely duplicated. What I mean, is that if I look for duplicates by grouping the key columns, having count(*) > 1, I find which ones are duplicates, but when I inspect the data on these duplicates, the other details in the remaining columns may differ. So our rule is: pick the first record, toss the rest of the duplicates.
Because we don't sort on any columns during the import, the first record kept of the duplicates is arbitrary. Again, we can't tell at this point which of the duplicated records is more correct. Someday down the road, we will do this research.
Now, I need to know the most efficient way to accomplish this in SSIS. If it makes it easier, I could just discard all the duplicates, since the number of them is so small.
If the source were a relational table, I could use a SQL statement to filter the records to remove the duplicates, but since the source is an XML file, I don't know how to filter these out in the pipeline, since the file has to be aggregated to search for dups.
DELETE FROM tblContacts WHERE tblContacts.ID IN( SELECT F.ID FROM tblContacts AS F WHERE Exists ( SELECT email, Count(ID) FROM tblContacts WHERE tblContacts.email = F.email GROUP BY tblContacts.email HAVING Count(tblContacts.ID) > 1 ) ) AND tblContacts.ID NOT IN( SELECT Min(ID) FROM tblContacts AS F WHERE Exists ( SELECT email, Count(ID) FROM tblContacts WHERE tblContacts.email = F.email GROUP BY tblContacts.email HAVING Count(tblContacts.ID) > 1 ) GROUP BY email )
I readily admit that I've shamelessly copied 'n pasted this from a tutorial and then taken a stab at tweaking it for my own ends. But I really don't understand what it's doing.
Really, all I want to know is that it will remove records with duplicate email fields. But I could also do with confirming - looking at the "SELECT Min(ID)" bit - does that mean that if it finds a duplicate, it'll delete the latest-added one? And if so, that changing it to remove the earliest-added one is simply a case of changing MIN to MAX?
I am working with a bunch of records that have duplicates on the Persid and the intPercentID where there are duplicates I want to remove when I stick them in the temp table, I tried join on tempo table and doing not exists but still inserts, so now I am trying a merge but same thing. how can I keep duplicates from being inserted in the temp table. I made a cursor as well but its slow as heck, but it does work. trying better ways.
Create table #TempStr (STRId int not null Identity(1,1) primary key, Persid int, percentId int, dtCreated datetime, CreatedBy int)
INSERT #TempStr (Persid, percentId, dtCreated, CreatedBy) select intPersonnelID, intPercentID, dtSubmitted, intSubmittedBy from tblSTR whereintpercentId in (61,62) group by intPercentID, intPersonnelID, dtSubmitted, intSubmittedBy UNION ALL
I have table with columns as ID, DupeID1, DupeID2. ID column is unique. DupeID1 and DupeID2 -- the combination should only be there once. I don't want reverse combination of duplicates, i.e. DupeID2, DupeID1 in the table. How can I delete the reverse duplicates from this table?
Product No Grade Quantity A Good A Normal A Bad B Good B Bad C Good C Normal C Bad
In Table 2, Product No divided by Grade. I want to lookup the Quantity from Table 1 to Table 2. The same Product No will have 1 value, the other value is 0. The result for Column Quantity should be like this:
Table 2:
Product No Grade Quantity A Good 1 A Normal 0 A Bad 0 B Good 2 B Bad 0 C Good 3 C Normal 0 C Bad 0
I am trying to use a date comparison in a statement using the year statement as well. Here is what I have:
Case [LastHireDate] When YEAR([LastHireDate]) < Year(@EndYearlyDate) then '12' When Month([LastHireDate]) = '1' then '12' When Month([LastHireDate]) = '2' then '11' When Month([LastHireDate]) = '3' then '10' When Month([LastHireDate]) = '4' then '9'
[Code] ....
When I am looking at it [LastHireDate] is showing that red line underneath. The < symbol has a red line and @EndYearlyDate has a red line. I can not seem to get them to clear and am, wondering what I am missing. When I execute the error comes up that it does not like the < sign in there.
Within the LinkingID, there are duplicates in ID1 and ID2 but just in opposite columns. I have been trying to figure out a way to remove these set based. It doesn't matter which duplicate is removed. Essentially these are just endpoints and I don't care which side they are on. The solution must recognize the duplicates and not just remove based on every 2nd row.
I have a bunch of contacts that I've scored how well their names match to other contacts in the same business. I can programmatically figure out how to parse the results, but would like to know how to do this via SQL. My problem is for Business_fk 968976 I have 7 contacts. In the end I should have 4 contacts based on name match. For the business key listed Gerardo Lopez is in the ContactScore table twice for Contact keys 7355719 and 57028145. I then have two rows like so:
Each reference each other, and 2 is a good case, a more difficult case would have key 1 listed 10 times showing a ContactMatch_fk of 2 - 11, and then Contact_fk 2 listed 10 times with a ContactMatch_fk of 1, 3-11.I know 57028145 maps to 7355719 from the first row in the ContactScore table, so when Contact_fk of 7355719 comes up I should be able to skip it and not process that match. Hopefully that makes sense. Anyway here is the test data:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ContactScore]') AND type in (N'U')) DROP TABLE [dbo].[ContactScore]; GO CREATE TABLE [dbo].[ContactScore] ( [ContactScore_pk]INT NOT NULL, [Contact_fk]INT NOT NULL,
How can i perform this task with ssis OR TRANSACT SQL? I HAVE THESE ROWS WITH THE NEXT DATA, I want to take just the valid one, BUT I HAVE A LOT OF COMBINATIONS AS following names, it can be animals, things or personal names
GABRIEL OBANDO --CORRECT GABRIEL OVANDO Gavriel OVANDO gAbriel OBANDO GABRIE OBANDO Gabri OBONDA MANAGUA --CORRECT NANAGUA NAMAGUA
Im working through the MS example of "removeDuplicates". I cant seem to figure out how to add custom property for input column.
I added the helper method: private static void AddIsKeyCustomPropertyToInput(IDTSInput90 input, object value) { IDTSCustomProperty90 isKey = input.CustomPropertyCollection.New(); isKey.Name = "IsKey"; isKey.Value = value; } I call it from: public override void ProvideComponentProperties() { //... AddIsKeyCustomPropertyToInput(input, false); //... } public override void ReinitializeMetaData() { IDTSInput90 input = ComponentMetaData.InputCollection[0]; if (input.CustomPropertyCollection.Count == 0) { AddIsKeyCustomPropertyToInput(input, false); } // ... }
However when I deployed it and added the component to SSIS package - I cant see the Custom Column "IsKey" in the input column properties window. What am I missing - please help
We all were new at one point.... any help is appreciated.
Objective:
Combining two 49,000 row tables and remove records where there is only 1 column difference. (keeping the specified column value removing the one with a blank.)
Reason:
I have 2 people going through a list, coding a specific column with a single letter value. They both have different progress on each sheet. Hence I am trying to UNION them and have a result of their combined efforts without duplicates.
My progress/where I'm stuck:
Here is my first query/union:
SELECT * FROM [Eds table] UNION SELECT * FROM [Vickis table];
As shown above, I have unioned these 2 tables and my results removed th obvious whole record duplicates, but since 1 column is different on these, a union without criteria considers them unique.....
an example of duplicates that I must remove are as follows:
I had Excel file input & import to DB Table by using Data flow in SSIS.but it had duplicates so I dont use the Dupe Records
So I planned like below:
Method 1: Here OLEDB Destination are Good Records(Without Duplicates) OLEDB Destination are Not Good Records(only Duplicates) or Method :2 If I add a column(GOOD_RECORD) in DB Table and Should I update '1' for top 1 record (for Good Record) and remaining as '0' for other Records (for Dups)latter I utilize Through flag of GOOD_RECORD
i.e.,, select * from DB_TABLE where GOOD_RECORD='1' .
I think that Method :2 Advisable for Performance/flexible but Here How can I update by using SSIS(Data flow) ????
I have some duplicate values for my query results, about 200 duplicates out of 30000 rows. Of these 200 duplicates I want to keep the ones that have a higher value for... 'UpdatedBatchID'.
SELECT IR.Id as 'ID' , CAST(IR.Priority as varchar) as 'Priority' , IRSupportGroupDN.DisplayName as 'Support Group' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.CreatedDate) as 'Created Date' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.ResolvedDate) as 'Resolved Date' , SLOConfig.DisplayName as 'SLO' , DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),SLOFact.TargetEndDate) as 'SLO Target' , SLOStatusDN.DisplayName as 'SLO Status' , SLOMetric.DisplayName as 'SLO Metric' , SLOFact.UpdatedBatchId as 'UpdatedBatchID'
I'm using several sort components in my dataflow task (for all now saying uuhh, in normal flow they are all not used, only in lookup failures). I know that the sort component is more than slow and deprecated.
So when I find some rows with a foreign key not yet in my dimTable, I throw them in another branch and sort them / aggregate them for insertion. After that, I wanna lookup again. So I have to wait for all inserts to complete. So I use the sort component, which has to wait for all rows, and sort for my business key. All rows inserted and sorted I lookup again.
This scenario worked for me in a lot of packages with a lot of rows. So why do I post :-)
I now hav another package. As of processing the stream, 1796 rows go to the sort component an wanna be sorted. But nothing happens. Processor goes idle for 0-5% for debugHost. HDD does some reading and writing but nothing really big. TempFiles for sorting are not getting refreshed. Memory Usage is 1.6GB (/3GB is set in boot.ini). No Error Message at all. And this state is for hours an hours.
Any hint why the package is going into that state?
I have successfully built a messaging system into my application, I now in the process of displaying the messages in the UI.
The following are how my tables are constructed.
CREATE TABLE [MailBox].[Message]( [Id] [bigint] IDENTITY(1,1) NOT NULL, [SenderId] [bigint] NOT NULL, [Message] [nvarchar](max) NOT NULL, [SentDate] [datetime] NOT NULL, CONSTRAINT [PK_MailBox.Message] PRIMARY KEY CLUSTERED
[Code] .....
Now I haven't set the foreign key on the MessageReceipient table yet. When someone sends me an email I insert a record into [MailBox].[Message] and output the insert id into MessageReceipient along with the ReceipientId this is working as expected, when I then click on my inbox I call the following stored procedure:
Select p.Username, count(mr.RecipientId) [TotalMessages], CASE WHEN mr.ReadDate is null then 1 -- New message WHEN mr.ReadDate is not null then 0 -- Message has been read END AS NewMessage FROM [User].[User_Profile] p JOIN [MailBox].[Message] m on p.Id = m.SenderId JOIN [MailBox].[MessageRecipient] mr on m.Id = mr.MessageId GROUP BY p.Username, mr.RecipientId, mr.ReadDate
This will give me the person who has emailed me, the total amount of messages and if the message is new or its been read, I do this by checking the ReadDate column as shown in the case statement (but this gives me duplicates, which is not what I want). Lets say user1 emails me 5 times so when I call this proc I will have the same user displayed to me 5 times, what I'm trying to achieve with the proc is it will show User1 as the following:
User1 5 Messages 1 or 0 New Messages
I can get it to display as follow when I remove the case statement
User1 5 Messages
but as soon as I add the case statement back in then I get 5 rows.
How can I change this proc in such a way that it will display the data as follows;
User1 5 Messages 1 or 0 New Messages
New Messages is dependent on ReadDate if its null then we have a new message, otherwise its been read.
how SQL 2012 would treat a literal string for a comparison similar to below. I want to ensure that the server isn't implicitly converting the value as it runs the SQL, so I'd rather change the data type in one of my tables, as unicode isn't required.
Declare @T Table (S varchar(2)) Declare @S nvarchar(255) Insert into @T Values ('AR'), ('AT'), ('AW') Set @S = 'Auto Repairs' Select * from @T T where case @S when 'Auto Repairs' then 'AR' when 'Auto Target' then 'AT' when 'Auto Wash' then 'AW' end = T.STo summarise
in the above would AR, AT and AW in the case statement be treated as a nvarchar, as that's the field the case is wrapped around, or would it be treated as a varchar, as that's what I'm comparing it to.
Hi,I'm trying to create a Stored Procedure that returns a recordset, but Iwant to be able to choose the ORDER BY clause in mijn parameter list ofthe Stored Procedure. Since CASE .. WHEN can only be used in the SELECTclause, I came up with the following:-- BEGIN SCRIPT --DECLARE @blah AS VARCHAR(20)SET @blah = 'DOSSIER_CODE'SELECT DOC_PID, SCAN_DATE, DOC_STATE, isViewed, DOC_COMMENT,requestDelete, USER_FNAME, USER_NAME, DOSSIER_CODE, COUNT(NOTE_PID)NrOfNotes,CASE @blahWHEN 'DOSSIER_CODE'THEN DOSSIER_CODEWHEN 'SCAN_DATE'THEN SCAN_DATEELSESCAN_DATEEND AS ORDERFIELDFROM MR_DOCSLEFT OUTER JOIN MR_USERSON MR_DOCS.USER_FID = USER_PIDLEFT OUTER JOIN MR_DOSSIERSON DOSSIER_FID = DOSSIER_PIDLEFT OUTER JOIN MR_NOTESON DOC_PID = MR_NOTES.DOC_FIDWHERE MR_DOCS.USER_FID = 1AND DOC_STATE IN (1, 3, 4)AND REMINDER_DATE <= getdate()AND MR_DOCS.isVisible = 1AND TREE_FID IS NULL-- Added by Tim Derdelinckx - 2005.06.20AND TODO_FID IS NULL-- Select documents that are scanned for this user (1),-- or moved to this user (3),-- or forwarded to this user (4),GROUP BY DOC_PID, SCAN_DATE, DOC_STATE, isViewed, DOC_COMMENT,requestDelete, USER_FNAME, USER_NAME, DOSSIER_CODEUNIONSELECT DOC_PID, SCAN_DATE, DOC_STATE, isViewed, DOC_COMMENT,requestDelete, USER_FNAME, USER_NAME, DOSSIER_CODE, COUNT(NOTE_PID)NrOfNotes,CASE @blahWHEN 'DOSSIER_CODE'THEN DOSSIER_CODEWHEN 'SCAN_DATE'THEN SCAN_DATEELSESCAN_DATEEND AS ORDERFIELDFROM MR_DOCSLEFT OUTER JOIN MR_USERSON USER_FID = USER_PIDLEFT OUTER JOIN MR_DOSSIERSON DOSSIER_FID = DOSSIER_PIDLEFT OUTER JOIN MR_NOTESON DOC_PID = MR_NOTES.DOC_FIDWHERE BORROW_USER_FID = 1AND DOC_STATE = 5AND REMINDER_DATE <= getdate()AND MR_DOCS.isVisible = 1AND TREE_FID IS NULL-- Added by Tim Derdelinckx - 2005.06.20AND TODO_FID IS NULL-- or borrowed to this userGROUP BY DOC_PID, SCAN_DATE, DOC_STATE, isViewed, DOC_COMMENT,requestDelete, USER_FNAME, USER_NAME, DOSSIER_CODEORDER BY ORDERFIELD DESC-- END SCRIPT --But it doesn't seem to work correctly:When SET @blah = 'SCAN_DATE', it works just fine!When SET @blah = 'DOSSIER_CODE':I get an error: Server: Msg 242, Level 16, State 3, Line 3The conversion of a char data type to a datetime data type resulted inan out-of-range datetime value.Warning: Null value is eliminated by an aggregate or other SEToperation.Anyone any ideas about this? Or maybe another way of handling this (notwith CASE .. WHEN)?Thanks a lot,Tim@Allgeier*** Sent via Developersdex http://www.developersdex.com ***
What I want to do is remove the time component from a datetime value. That is to say I want 2008-04-18 00:00:00.000 intead of 2008-04-18 13:46:57.983. I have heard that the only way to do this is to convert the datetime to a string, strip off the time portion and then convert back to a date. Isn't there a simpler way?
I recenly stumbled upon this construct: { d '2008-04-18' } . It appears that this might accomplish what I am lookig for above except that it appears to only except literal strings and not variables. I think this construct comes from the ODBC "improvements" realm so I am leary of depending on it as I am not sure it is standard. Any insights and links to more documentation on using this new, for me, construct would be appreciate.
I have a query which filters records containing uppercase andLowercase i.e.Smith and SMITH, Henderson and HENDERSON etc.Is there a way that I can filter only those records that contain thefirst uppercase letter and the remaining lowercase letters for myquery i.e. Smith , HENDERSON etc.Thanks
Hi All, How to remove case sensitity from database like table names,column names etc. If we type either select * from AUTHORS" or "select * from authors" should result the same value. Abdul
;WITH ctePreAgg AS ( select top 500 act_reference "ActivityRef", row_number() over (partition by act_reference order by act_reference) as rowno, t3.s_initials "Initials" from mytablestuff order by act_reference
[code]...
But what I would love to do next is take each of the above rows - and return the initials either in one column with all the nulls and duplicate values removed, separated by a comma ..
OR the above but using variable number of columns based on the maximum number of different initials for each row.this is not strictly required, but maybe neater for further work on the view