Remove Duplicates Within Pipeline
Sep 27, 2006
I have a situation where we get XML files sent daily that need uploading into SQL Server tables, but the source system producing these files sometimes generates duplicate records in the file. The tricky part is, that the record isn't entirely duplicated. What I mean, is that if I look for duplicates by grouping the key columns, having count(*) > 1, I find which ones are duplicates, but when I inspect the data on these duplicates, the other details in the remaining columns may differ. So our rule is: pick the first record, toss the rest of the duplicates.
Because we don't sort on any columns during the import, the first record kept of the duplicates is arbitrary. Again, we can't tell at this point which of the duplicated records is more correct. Someday down the road, we will do this research.
Now, I need to know the most efficient way to accomplish this in SSIS. If it makes it easier, I could just discard all the duplicates, since the number of them is so small.
If the source were a relational table, I could use a SQL statement to filter the records to remove the duplicates, but since the source is an XML file, I don't know how to filter these out in the pipeline, since the file has to be aggregated to search for dups.
Thanks
Kory
View 5 Replies
ADVERTISEMENT
Jan 9, 2008
I have a query which gives the following output, How can i get a output like this:
QUERY
COL1COL2COL3
A1AAGG
A1BBHH
A1CCJJ
B1DDKK
B1EELL
B1FFMM
OUTPUT
COL1COL2COL3
A1AAGG
BBHH
CCJJ
B1DDKK
EELL
FFMM
View 5 Replies
View Related
Jan 29, 2008
Hi All
I have the dbo.OperatingHour It has many duplicates and I want to remove duplicates permanently
The statement below works but when I open the table there are no changes
Insert into OperatingHour(Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays])
(SELECT DISTINCT Weekdays, Wednesdays, Fridays,Saturdays, [Sundays/Public Holidays] FROM OperatingHour)
View 2 Replies
View Related
May 24, 2007
Welcome,how can I alter following table in order to reduce neighbouringduplicates (symbol, position, quantity, price).Nr Symbol Position QuantityPrice Date1. wz9999b 1 1.02500.0 2007-05-09 08:09:42.6532. wz9999b 2 12.02500.0 2007-05-09 08:09:42.6533. wz9999b 1 100.02590.0 2007-05-10 15:47:04.1404. PZ0008VX 1 2280.8842090.55000000000022007-05-1612:43:12.4035. PZ0008VX 1 2280.8842102.05000000000022007-05-1612:45:27.4206. wz9999b 1 0.0012500.0 2007-05-18 09:47:16.0337. wz9999b 1 0.0012500.0 2007-05-18 09:47:53.2708. wz9999b 1 1.01.0 2007-05-22 12:35:07.8939. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:26.16010. PZ0008VX 1 2280.8842102.05000000000022007-05-2409:38:38.80011. wz9999b 1 0.001 2500.02007-05-24 12:35:07.20712 wz9999b 1 0.002 2500.02007-05-24 12:35:14.98713. wz9999b 1 0.001 2500.02007-05-24 12:38:07.207In the result set I would like to get the rows number 6 and 10.Any suggestions??
View 2 Replies
View Related
Oct 2, 2006
DELETE
FROM tblContacts
WHERE tblContacts.ID IN(
SELECT F.ID
FROM tblContacts AS F
WHERE Exists (
SELECT email, Count(ID)
FROM tblContacts
WHERE tblContacts.email = F.email
GROUP BY tblContacts.email
HAVING Count(tblContacts.ID) > 1
)
)
AND tblContacts.ID NOT IN(
SELECT Min(ID)
FROM tblContacts AS F
WHERE Exists (
SELECT email, Count(ID)
FROM tblContacts
WHERE tblContacts.email = F.email
GROUP BY tblContacts.email
HAVING Count(tblContacts.ID) > 1
)
GROUP BY email
)
I readily admit that I've shamelessly copied 'n pasted this from a tutorial and then taken a stab at tweaking it for my own ends. But I really don't understand what it's doing.
Really, all I want to know is that it will remove records with duplicate email fields. But I could also do with confirming - looking at the "SELECT Min(ID)" bit - does that mean that if it finds a duplicate, it'll delete the latest-added one? And if so, that changing it to remove the earliest-added one is simply a case of changing MIN to MAX?
Thanks :)
View 11 Replies
View Related
Dec 3, 2006
If we want to remove the duplicate row and leave only one row instead of 2 or 3 rows for example with the same column values.
2/ The same question but when all the columns of the row are duplicate except the id field.
Thanks a lot.
View 3 Replies
View Related
Oct 6, 2015
I am working with a bunch of records that have duplicates on the Persid and the intPercentID where there are duplicates I want to remove when I stick them in the temp table, I tried join on tempo table and doing not exists but still inserts, so now I am trying a merge but same thing. how can I keep duplicates from being inserted in the temp table. I made a cursor as well but its slow as heck, but it does work. trying better ways.
Create table #TempStr (STRId int not null Identity(1,1) primary key, Persid int, percentId int, dtCreated datetime, CreatedBy int)
Create table #NewStr (STRId int, Persid int, percentId int, dtCreated datetime, CreatedBy int)
INSERT #TempStr (Persid, percentId, dtCreated, CreatedBy)
select intPersonnelID, intPercentID, dtSubmitted, intSubmittedBy from tblSTR
whereintpercentId in (61,62) group by intPercentID, intPersonnelID, dtSubmitted, intSubmittedBy
UNION ALL
[code]....
View 3 Replies
View Related
Sep 1, 2015
I have table with columns as ID, DupeID1, DupeID2. ID column is unique. DupeID1 and DupeID2 -- the combination should only be there once. I don't want reverse combination of duplicates, i.e. DupeID2, DupeID1 in the table. How can I delete the reverse duplicates from this table?
View 10 Replies
View Related
Jul 13, 2015
I have 2 tables below:
Table 1:
Product No Quantity
A 1
B 2
C 3
Table 2:
Product No Grade Quantity
A Good
A Normal
A Bad
B Good
B Bad
C Good
C Normal
C Bad
In Table 2, Product No divided by Grade. I want to lookup the Quantity from Table 1 to Table 2. The same Product No will have 1 value, the other value is 0. The result for Column Quantity should be like this:
Table 2:
Product No Grade Quantity
A Good 1
A Normal 0
A Bad 0
B Good 2
B Bad 0
C Good 3
C Normal 0
C Bad 0
View 8 Replies
View Related
Jan 22, 2015
I have a table containing the following data:
LinkingIDID1 ID2
166202180659253178
166202253178180659
166334180380253179
166334253179180380
166342180380180659
166342180659180380
166582253179258643
166582258643253179
264052258642258643
264052258643258642
264502258643258663
264502258643259562
Within the LinkingID, there are duplicates in ID1 and ID2 but just in opposite columns. I have been trying to figure out a way to remove these set based. It doesn't matter which duplicate is removed. Essentially these are just endpoints and I don't care which side they are on. The solution must recognize the duplicates and not just remove based on every 2nd row.
View 8 Replies
View Related
Aug 11, 2015
I have a bunch of contacts that I've scored how well their names match to other contacts in the same business. I can programmatically figure out how to parse the results, but would like to know how to do this via SQL. My problem is for Business_fk 968976 I have 7 contacts. In the end I should have 4 contacts based on name match. For the business key listed Gerardo Lopez is in the ContactScore table twice for Contact keys 7355719 and 57028145. I then have two rows like so:
PossibleBusinessContactMatch_pk BusinessContact_fk Business_fk BusinessContactMatch_fk MatchTypeCode MatchScore MatchRank FirstName LastName Phone Email
------------------------------- ------------------ ----------- ----------------------- ------------- ----------- ----------- -------------------------------------------------- -------------------------------------------------- ---------- --------------------------------------------------------------------------------------------------------------------------------
1772960 57028145 968976 7355719 C 46 1 GERARDO I LOPEZ 8162214000
838834 7355719 968976 57028145 C 50 1 GERARDO
Each reference each other, and 2 is a good case, a more difficult case would have key 1 listed 10 times showing a ContactMatch_fk of 2 - 11, and then Contact_fk 2 listed 10 times with a ContactMatch_fk of 1, 3-11.I know 57028145 maps to 7355719 from the first row in the ContactScore table, so when Contact_fk of 7355719 comes up I should be able to skip it and not process that match. Hopefully that makes sense. Anyway here is the test data:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ContactScore]') AND type in (N'U'))
DROP TABLE [dbo].[ContactScore];
GO
CREATE TABLE [dbo].[ContactScore]
(
[ContactScore_pk]INT NOT NULL,
[Contact_fk]INT NOT NULL,
[code]..
View 9 Replies
View Related
Apr 23, 2015
How can i perform this task with ssis OR TRANSACT SQL? I HAVE THESE ROWS WITH THE NEXT DATA, I want to take just the valid one, BUT I HAVE A LOT OF COMBINATIONS AS following names, it can be animals, things or personal names
GABRIEL OBANDO --CORRECT
GABRIEL OVANDO
Gavriel OVANDO
gAbriel OBANDO
GABRIE OBANDO
Gabri OBONDA
MANAGUA --CORRECT
NANAGUA
NAMAGUA
View 5 Replies
View Related
Oct 15, 2006
Im working through the MS example of "removeDuplicates". I cant seem to figure out how to add custom property for input column.
I added the helper method:
private static void AddIsKeyCustomPropertyToInput(IDTSInput90 input, object value)
{
IDTSCustomProperty90 isKey = input.CustomPropertyCollection.New();
isKey.Name = "IsKey";
isKey.Value = value;
}
I call it from:
public override void ProvideComponentProperties()
{
//...
AddIsKeyCustomPropertyToInput(input, false);
//...
}
public override void ReinitializeMetaData()
{
IDTSInput90 input = ComponentMetaData.InputCollection[0];
if (input.CustomPropertyCollection.Count == 0)
{
AddIsKeyCustomPropertyToInput(input, false);
}
// ...
}
However when I deployed it and added the component to SSIS package - I cant see the Custom Column "IsKey" in the input column properties window.
What am I missing - please help
View 3 Replies
View Related
Sep 9, 2004
Hello All,
We all were new at one point.... any help is appreciated.
Objective:
Combining two 49,000 row tables and remove records where there is only 1 column difference. (keeping the specified column value removing the one with a blank.)
Reason:
I have 2 people going through a list, coding a specific column with a single letter value. They both have different progress on each sheet. Hence I am trying to UNION them and have a result of their combined efforts without duplicates.
My progress/where I'm stuck:
Here is my first query/union:
SELECT * FROM [Eds table]
UNION SELECT * FROM [Vickis table];
As shown above, I have unioned these 2 tables and my results removed th obvious whole record duplicates, but since 1 column is different on these, a union without criteria considers them unique.....
an example of duplicates that I must remove are as follows:
142301 - Product 5000 - 150# - S (Keep)
142031 - Product 5000 - 150# - "" <--- Blank (Remove)
I am trying to run another query on my first query results so I don't mess my first query up. Here it is:
SELECT DISTINCT [Prod #], [Prod Name], [Prod Description], [Product Type]
FROM [Combined Tables]
WHERE [Product Type]<>" ";
Please Help! Thank you in advance.
--------------------
5 minutes away from pulling my last one!
BaldNAskewed
View 7 Replies
View Related
Jun 23, 2015
I had Excel file input & import to DB Table by using Data flow in SSIS.but it had duplicates so I dont use the Dupe Records
So I planned like below:
Method 1:
Here OLEDB Destination are Good Records(Without Duplicates)
OLEDB Destination are Not Good Records(only Duplicates)
or
Method :2
If I add a column(GOOD_RECORD) in DB Table and Should I update '1' for top 1 record (for Good Record) and remaining as '0' for other Records (for Dups)latter I utilize Through flag of GOOD_RECORD
i.e.,, select * from DB_TABLE where GOOD_RECORD='1' .
I think that Method :2 Advisable for Performance/flexible but Here How can I update by using SSIS(Data flow) ????
View 4 Replies
View Related
Jun 22, 2015
I have some duplicate values for my query results, about 200 duplicates out of 30000 rows. Of these 200 duplicates I want to keep the ones that have a higher value for... 'UpdatedBatchID'.
SELECT
IR.Id as 'ID'
, CAST(IR.Priority as varchar) as 'Priority'
, IRSupportGroupDN.DisplayName as 'Support Group'
, DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.CreatedDate) as 'Created Date'
, DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),IR.ResolvedDate) as 'Resolved Date'
, SLOConfig.DisplayName as 'SLO'
, DATEADD(MI,DATEDIFF(mi,GETUTCDATE(),GETDATE()),SLOFact.TargetEndDate) as 'SLO Target'
, SLOStatusDN.DisplayName as 'SLO Status'
, SLOMetric.DisplayName as 'SLO Metric'
, SLOFact.UpdatedBatchId as 'UpdatedBatchID'
View 11 Replies
View Related
Feb 29, 2008
What happens when you add the Ignore Case flag into the mix?
I'm having a hell of a time - I'm dealing with an SCD situation using TableDifference component and I have both existing dimensions and new data coming in, each go through identical Case-Insensitive/Sort with remove duplicates, but I'm getting identical new and deleted records detected - I think because of ordering issues. I'm still trying to whittle the test case down, but I think data from all around the records I'm investigating seems to get sorted in between them, so I'm having trouble getting a small test case built.
I think the mixed case data is the root of the problem, and I think the design is bad, but before I go back to the technical lead, I need to understand enough to show that you cannot take two pipelines sorted and de-duped case-insensitively and then do a case-sensitive table difference operation.
View 4 Replies
View Related
Aug 5, 2014
I managed to transpose rows into columns.
;WITH
ctePreAgg AS
(
select top 500 act_reference "ActivityRef",
row_number() over (partition by act_reference order by act_reference) as rowno,
t3.s_initials "Initials"
from mytablestuff
order by act_reference
[code]...
But what I would love to do next is take each of the above rows - and return the initials either in one column with all the nulls and duplicate values removed, separated by a comma ..
ref, initials
Ag-4xYS
Ag-6xYS,BL
Ap-1xKW
At-2x SAS,CW
At-3x SAS,CW
OR the above but using variable number of columns based on the maximum number of different initials for each row.this is not strictly required, but maybe neater for further work on the view
ref, init1,init2
Ag-4xYS
Ag-6xYS,BL
Ap-1xKW
At-2x SAS,CW
At-3x SAS,CW
View 6 Replies
View Related
Apr 6, 2007
The scenario is as follows: I have a source with many rows. Each row has a column called max_qty_value. I need to perform a calculation using another column called qty. This calculation is something similar to dividing qty/(ceiling) max_qty_value. Once I have that number I need to write an additional duplicate row for each value from the prior calculation performed. For example, 15/4 = 4. I need to write 4 rows to the same target table as in line information for a purchase order.
The multicast transform appears to only support fixed and/or predetermined outputs. How do I design this logic in SSIS to write out dynamic number of rows to a target table.
Any ideas would be greatly appreciated.
thanks
John
View 18 Replies
View Related
Oct 22, 2014
I have a table with 22 million Business records. I can see that there are duplicates when I group by BusinessName and Address and Phone. I'd like to place only the duplicates into a table, with a ranking, oldest business key gets a ranking of 1.
As a bonus I'd like each group to have a distinct group name (although not necessary, just want to know how to do this)
Later after I run more verifications to make sure these are not referenced elsewhere I'll delete everything with a matchRank > 1 out of the main Business table.
DROP TABLE [dbo].[TestBusiness];
GO
CREATE TABLE [dbo].[TestBusiness](
[Business_pk] INT IDENTITY(1,1) NOT NULL,
[BusinessName] VARCHAR (200) NOT NULL,
[Address] VARCHAR(MAX) NOT NULL,
[code]....
View 9 Replies
View Related
Jan 26, 2015
Is there a query or a way to convert duplicates value in a column to non duplicates.
View 14 Replies
View Related
Apr 20, 2006
Hi,
I want to incorporate this code but I dont know how to import Microsoft.SqlServer.Dts.Pipeline in an Integration Services Project template. I was thinking of putting this code in the script task but still, I cant import Pipeline. Add reference list does not have it as well. Please let me know how to incorporate this code. Thanks!
Code:
if (ComponentMetaData.RuntimeConnectionCollection["SourceFileConnection"].ConnectionManager != null)
{
cm = DtsConvert.ToConnectionManager(ComponentMetaData.RuntimeConnectionCollection["SourceFileConnection"].ConnectionManager);
if (cm.CreationName == "FILE")
{
fileUsage = (Microsoft.SqlServer.Dts.Runtime.DTSFileConnectionUsageType)cm.Properties["FileUsageType"].GetValue(cm);
if (fileUsage == Microsoft.SqlServer.Dts.Runtime.DTSFileConnectionUsageType.FileExists)
{
connectionString = ComponentMetaData.RuntimeConnectionCollection["SourceFileConnection"].ConnectionManager.AcquireConnection(transaction).ToString();
if (connectionString == null || connectionString.Length == 0)
{
throw new Exception("No file name specfiy");
}
}
else throw new Exception("Incorrect file connection usage type, should be set to exiting file type");
}
else throw new Exception("Connection is not a file connection");
}
else throw new Exception("Connection is not as assign");
View 1 Replies
View Related
Nov 30, 2006
Hi all,
I tried to remove AdventureWorksDB in the "Add or Remove Programs" of Contol Panel and I got the following errors: (1) AdventureWorksDB Error 1326: Error getting file security: CProgram FilesMicrosoft SQL ServerMSSQL1MSSQLGetLastError: 5. |OK| and (2) Add or Remove Programs Fatal Error during installation (after I clicked the |OK| button). Please help and tell me how I can solve this problem.
Thanks in advance,
Scott Chang
View 1 Replies
View Related
Oct 27, 2006
This is probably obvious, but how do I split a pipeline. I.e. I've got a data source with 200 columns - I need to split this into 20 pipelines each containing 10 of the original columns.
View 7 Replies
View Related
Oct 12, 2006
I have uninstalled the CTP version of the SQL Server express so that I can install the released version but CTP version is still listed in the add/remove program list but without the change/remove button. I have been to different sites to find information on cleaning this up and I have ran all the uninstall tool I can find but the problem still prevails. I cannot install the released version without completely getting rid of the CTP version. Please help anyone.
Thanks
deebeez1
View 1 Replies
View Related
Nov 27, 2006
We have deployed an SSIS package successfully to production. We needed to apply SP1 to fix a different issue and now have encountered a new problem. We have numerous Data Reader Sources in different Data Flow Tasks that connect to a IBM iSeries (DB2) source. Pretty simple extracts that have worked fine in the past. They pump the data into staging tables on the SQL2K5 instance running the package (64-bit).
After we applied SP1 however, all of the Data Reader tasks fail AFTER they successfully copy the records with the following error.
[iSeries Invoice Details [1]] Error: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.SqlServer.Dts.Pipeline.DataReaderSourceAdapter.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers) at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput(IDTSManagedComponentWrapper90 wrapper, Int32 outputs, Int32[] outputIDs, IDTSBuffer90[] buffers, IntPtr ppBufferWirePacket)
If I delete the source and destination and recreate identical transforms, they work fine, but I don't feel like rebuilding all of the extracts. Any ideas! The problem occurs in all environments that we've tried.
TIA,
Michael Shugarman
P.S. I just tried the SP2 CTP, but that doesn't fix the problem.
View 2 Replies
View Related
Jul 10, 2007
Hi I have created a simple SSIS project on my client that carries out 4 Data Flow tasks, each one copying a few hundred rows from an Oracle 10.0.2 database. This works OK and will also run in debug mode fine.
I have copied the package to the file system on our development server and get the following error when in debug mode:-
[DTS.Pipeline] Information: Validation phase is beginning.
Progress: Validating - 0 percent complete
[OLE DB Source [1]] Error: SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "Server.user" failed with error code 0xC0202009. There may be error messages posted before this with more information on why the AcquireConnection method call failed.
[DTS.Pipeline] Error: component "OLE DB Source" (1) failed validation and returned error code 0xC020801C.
Progress: Validating - 50 percent complete
[DTS.Pipeline] Error: One or more component failed validation.
Error: There were errors during task validation.
Validation is completed
[Connection manager "Server.user"] Error: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft OLE DB Provider for Oracle" Hresult: 0x80004005 Description: "Error while trying to retrieve text for error ORA-01019 ".
Validation is completed
If you go to the source of each flow task and select preview you can retreive the data.
Thanks Paul
View 1 Replies
View Related
Jul 24, 2007
Is there any way I can capture the below information? I want to capture this to get the no of rows processed by each transformation.
[DTS.Pipeline] Information: "component "abc" (3798)" wrote 2142 rows.
[DTS.Pipeline] Information: "component "xyz" (4223)" wrote 1026 rows.
[DTS.Pipeline] Information: "component "abc2" (4324)" wrote 7875 rows.
Thanks
View 7 Replies
View Related
Apr 24, 2008
Hi
I have an existing application that programmatically builds SSIS 2005 packages.
I'm trying to get to working with the February CTP of SQL Server 2008. Having changed all the 2005 references to 2008 references and things like IDTSComponentMetaData90 to IDTSComponentMetaData100, my application compiles okay now, but hits a problem when it tries to create a Data Flow task.
The code which worked fine before (and seems to still be the recommended way in Books Online is):
Code Snippet
Dts.TaskHost myMainPipe = (Dts.TaskHost)container.Add("DTS.Pipeline.1");
However, this now produces the exception:
Cannot create a task with the name "DTS.Pipeline.1". Verify that the name is correct.
Should I be using a different moniker now? I took a stab at "DTS.Pipeline.2", but that didn't make a difference.
Thanks,
Andrew
View 10 Replies
View Related
Dec 20, 2006
Hello,
I'm wish to receive pipeline events fired by a SSIS package.
I execute the package successufully with the following code (c#):
MyEventListener eventListener = new XplorerEventListener();
DtsApplication app = new DtsApplication();
Package pkg = app.LoadPackage("c: est.dstx", null);
pkg.Execute(null, null, eventListener, null, null);
MyEventListener is inherited from DefaultEvents, overriding all OnXXX methods.
It works perfectly, however I cannot intercept the following events:
- PipelineExecutionTrees
- PipelineExecutionPlan
- PipelineExecutionInitialization
- BufferSizeTuning
- PipelineInitialization
Anyone knows how to catch those pipeline events?
TIA,
Paolo.
View 1 Replies
View Related
Jan 17, 2007
Alot of people complain, legitamately, that they wish to remove columns from the SSIS pipeline that they know are not going to be used again. This would help to avoid the "clutter" that can exist when there are alot of columns in the pipeline.
If you are one of those people then click-through below, vote and (most importantly) add a comment. The more people that do that - the more likely we are to get this functionality in a future version.
SSIS: Hide columns in the pipeline
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=252462
-Jamie
View 4 Replies
View Related
Nov 19, 2007
Hi, My package hangs and the log says DTS.Pipeline: Validation phase is beginning. Any ideas why this is happennig? This same package runs fine when I run it without turning on the transaction.
View 4 Replies
View Related
Jun 12, 2007
I have several stage to star (i.e. moving data from a staging table through the key lookups into a fact table) ETL transformations in a single SSIS package. Each fact table has a different set of measures but the identical foreign key set, e.g. ConsultantKey, SubsidiaryKey, ContestKey, ContestParamKey and MonthKey.
Currently I have to replicate the key lookup (Surrogate Key Pipeline, or SKP) for each data flow. If I could cache each dimension one time in the package and reuse it for each stage to fact it would be much more efficient.
Is there a way for me to reuse a common data flow?
View 6 Replies
View Related