Integration Services :: Merge Inner Join Gives Different Output Based On Sort Key?
Sep 23, 2015
In the first image as can be seens i have 2 different data sources and then they are being joined using "Merge Inner Join". The "sort" is on BusinessEntityID column of Person table and "Sort1" is on "PersonID" of Customer table. The merge join of these 2 result in 19,119 rows.
On the other hand, if i use single data source and use a query with inner join on tables used in the first image (ie. 2 tables being used in 2 different data sources) as depicted in second image. Also, since merge cannot operate without SortKey i have defined TerritoryID as sort key in the advanced editor. The number of rows i get after this is "10,274". My select query was :
SELECT
P.BusinessEntityID,
P.PersonType,
P.Title,
P.FirstName,
P.MiddleName,
P.LastName,
P.Suffix,
C.TerritoryID
FROM stg.Person AS P
INNER JOIN stg.Customer AS C ON C.CustomerID = P.BusinessEntityID
ORDER BY C.TerritoryID;
According to me, it should have been the same as in first case i am using merge inner join and in second case i am using SELECT query with inner join. Upon drilling down i found that in the first case , my sort keys are BusinessEntityID and PersonID, if i modify this to CustomerID and BusinessEntityID as this is my join condition (in ithe inner join query shown above), i get the desired output. What i was wondering was, how the sort order change the Join Condition?
I am using SSIS in SQL Server Enterprise 2005. I have two OLE DB data sources from two disparate databases (IBM DB2 and Microsoft SQL Server), some columns from each of which are to be included in the merged output results. I have noted the various requirements in the forum postings with regard to sorting the OLE DB sources and specifying the output source columns as being sorted, as well as the requirement that the join fields in the two sources be close/exact matches. Yet, when I run this in VS, while the work area reflects the expected number of rows being input into the Merge Join transformation, no count is reflected as output from that transformation into the final destination table.Specifically, my two data sources (IBM DB2 and MS SQL) are configured as follows:
IBM DB2 contains an SQL statement that uses Cast operations to create the result columns.and an ORDER BY clause to ensure that the output is sorted by the desired two columns.. The OLE DB source property setting for IsSorted is set to true; the Output Columns folder column definitions for "key_ source_dtsy" and "key_source_dtrt" have their SortKeyPosition properties set to 1 and 2, respectively. Those field are both defined as data type DT_STR, with lengths of 4 and 2, respectively. Below is the Path metadata from the Data Flow Path editor from the path from this source:
IBM DB2 source"Name" "Data Type" "Precision" "Scale" "Length" "Code Page" "Sort Key Position" "Comparison Flags" "Source Component""ID_CODE" "DT_STR" "0" "0" "10" "1252" "0" "" "Source F0005 User Defined Codes""CODE_DESCR_1" "DT_STR" "0" "0" "30" "1252" "0" "" "Source F0005 User Defined Codes""CODE_DESCR_2" "DT_STR" "0" "0" "30" "1252" "0" "" "Source F0005 User Defined Codes""key_source_dtsy" "DT_STR" "0" "0" "4" "1252" "1" "" "Source F0005 User Defined Codes""key_source_dtrt" "DT_STR" "0" "0" "2" "1252" "2" "" "Source F0005
User Defined Codes:
MS SQL contains an SQL statement that takes the columns as they are in the MS SQL table (no Cast operations needed); it also uses an ORDER BY clause to ensure the output is sorted by the join columns. The OLE DB source property setting for IsSorted is set to true; the Output Columns folder columns for "key_source_dtsy" and "key_source_dtrt" have their SortKeyPosition properties set to 1 and 2, respectively. Those field are both defined as data type DT_STR, with lengths of 4 and 2, respectively. Below is the Path metadata from the Data Flow Path editor from the path from this source:
MS SQL source"Name" "Data Type" "Precision" "Scale" "Length" "Code Page" "Sort Key Position" "Comparison Flags" "Source Component""id_code_name" "DT_I2" "0" "0" "0" "0" "0" "" "Source CodeName in db dwVdFY""key_source_dtsy" "DT_STR" "0" "0" "4" "1252" "1" "" "Source CodeName in db dwVdFY""key_source_dtrt" "DT_STR" "0" "0" "2" "1252" "2" "" "Source CodeName in db dwVdFY"
The Merge Join transformation specifies an INNER JOIN using the columns named "key_source_dtsy" and "key_source_dtrt" from the respective data sources.I know there are alternative ways of accomplishing my intent (Lookup, port MS SQL table to IBM DB2 so join can occur in SELECT statement, etc.; however, I'd like to use this functionality and assume that it should work.
How do I pass a single column of values from a successful merge join to an EXECUTE SQL statement so it can be used with an "IN" criteria of the WHERE clause? Here's an example of my update statement with two random key values:
UPDATE dbo.MyTable SET MyStatus = 1 WHERE MyPK IN ("XYZ123", "DEF890")
Is this even possible in SSIS, or am I better off using a loop and running the update EXECUTE SQL Statement for each individual key value, as in the following example?
UPDATE dbo.MyTable SET MyStatus = 1 WHERE MyPK = "XYZ123" UPDATE dbo.MyTable SET MyStatus = 1 WHERE MyPK = "DEF890"
When select this data with an order by like: select test from table_2 order by test The result will be:
Code Snippet
test 1-1.00.00.00 1000 2000 If you sort the data by the SORT block of the SSIS the result will be:
Code Snippet
test 1000 1-1.00.00.00 2000This is annoying and dangerous, because it causes the next bug.
2/ Two datasources sorted by ORDER BY clause can give problems in a Merge Join.
If you have 2 data sources both correctly sorted by an order by in the query. When you join these 2 datasources with a Merge Join, you can lose some records in the dataflow. This happens with larger datasets than examples above.
I need to run an Insert query which pulls data from a table located on server A database AA Table AAA conditional on (or JOINED with) Table BBB in database BB sever B. In SQL 2000 it could be done as:
From Server A: sp_addlinkedserver B INSERT dbo.ResultsTable SELECT SourceTable.* FROM B.BB.dbo.BBB SourceTable INNER JOIN A.AA.dbo.AAA ConditionTable ON SourceTable.RecID = ConditionTable.RecID sp_dropserver B
In SSIS one of the possible solutions is to use a package which does the following: OPEN A + OPEN B-> SORT A + SORT B->MERGE JOIN A and B->OUTPUT RESULT
The problem with this approach is that it's extremely slow for large datafiles (50M records each)
Questions:
1) In the procedure above could the SORT step be avoided? 2) Is there another approach to run cross-servers JOIN in SSIS?
As bcp does not allow for the column names to be included; I have developed a method for providing the columns. The end result is that two Tables are required for each output; a "ColumnNames" table and the Table that contains the actual data; however the bcp command is sorting the data; why this is happening?
According to Microsoft, by default bcp will not apply any sorting unless specified.
Here is the command I am using to perform the bcp output: -
SET @bcpCommand =(select 'bcp "SELECT * FROM GPReports.dbo.MIS001_BCPColumnNames UNION SELECT * FROM GPReports.dbo.voltemp" queryout ' + @FilePath+' -c -t -T')
I've run into something that looks like a bug to me but I wanted to run it by the board:
Merge join 2 sorted tables.
Table1: ColumnA : Sort Order 1, ColumnB Sort Order 2
Table2 : ColumnA: Sort Order 1, ColumnB Sort Order 2, ColumnC not sorted
Merge Join the two tables on ColumnA and ColumnB...
Choose the following as output columns
A + B + C = works
C = works
A + C = works
B + C = NOT work.. error message: The column with the SortKeyPosition value of 0 is not valid. It should be 2.
Basically if you choose one or more of the sorted columns in the output at least one of them has to be the column with Sort position 1 or you'll get that error.
Is this a bug or intentional? If you do not have sort column 1 in the output that output could no longer be considered sorted... so perhaps the error is related to that (instead of error I'd expect some warning about the sorting). Interesting that it lets you choose C only becuase that also makes the output unsorted.
I have a problem with a Merge Join providing no output (when it should have 1890 rows). My Data Flow Task has 4 OLE Data Sources, 3 Multicasts, and 1 OLE Data Destination. I am experiencing the problem near the end of my data flow where two Multicasts create two parallel flows of data (see Level 1 below). I have two Merge Joins which join one leg from each multicast with a leg from the other multicast (see Level 2 below). Then the two remaining legs use a Merge to get my destination output (see Level 3 below).
I am experiencing my problem with the Merge Join (input A2, B2) --> (output C2) transformation. The Merge Join providing output C1 appropriately outputs 1890 rows, but C2 outputs 0 rows. Both Merge Joins are identical. The data is identically sorted prior to entering the problematic Merge Join and a DataViewer (Grid) verified that the data is appropriately entering in. Merge Join (input A2, B2) --> (output C2) has 667 rows as input A2 and 1890 rows as input B2 (using an inner join, just like the other merge join), but C2 baffles me with 0 rows of output (when it too should have 1890). I receive no Ouput errors and the execution completes showing all green.
I read about mysterious behavior with Merge Joins and have attempted modifying my EngineThreads property to values between 2 and 10, with no luck. Any help/ideas would be appreciated.
I have a sql statement that joins two tables and I get back a few thousand records when I run it in query tool in management studio.
But when I use SSIS merge join to join the two tables my output is 0 records.
I did sort the key column in both tables by setting 'sortkeyposition' property to 1 in advanced editor for output of both tables.
however the merge join returns nothing to my destination tables. Also I am doing a inner join. The task runs without error but returns nothing as well.. any ideas?
I'm doing a data conversion with one of my fields (SUMDWK) from one of the tables that will be used in a merge join. With the new, converted field, I do a look up. From this look up, I want to take a new field FiscalWeekOfYear, and replace the original field, SUMDWK. This is necessary because SUMDWK is one of the sorted fields. In the look up, it is not possible to change the Output Alias. Does anybody know a way around this? Thanks.
I created a package that seems to work fine with a small amount of data. When I run the package however with more data (as in production) the merge join output is limites to 9963 rows, no matter if I change the number of input rows.
Situation as follows.
The package has 2 OLE DB Sources, in which SQL-statements have been defined in order to retrieve the data.
The flow of source 1 is: retrieving source data -> trimming (non-key) columns -> sorting on the key-columns.
The flow of source 2 is: retrieving source data -> deriving 2 new columns -> aggregating the data to the level of source 1 -> sorting on the key columns.
Then both flows are merged and other steps are performed.
If I test with just a couple of rows it works fine. But when I change the where-clause in the data source retrieval, so that the number of rows is for instance 15000 or 150000 the number of rows after the merge join is 9963.
When I run the package in debug-mode the step is colored green, nevertheless an error is displayed:
Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Merge Join" (4703) failed with error code 0xC0047020. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
To be honest, a few more errormessages appear, but they don't seem related to this issue. The package stops running after some 6000 rows have been written to the destination.
ID Name Date 1 A null 2 B 01/01/2012 3 C 01/02/2013
Also, I have a sort parameter @sort and values are (Name, ID, Date)
I want to apply page break whenever @sort=Name. There should be no page break when user selects @sort = ID or Date. Page break should happen only when @sort value = Name
it should be like this...
Page 1: ID Name Date 1 A null Page 2: ID Name Date 2 B 01/01/2012 Page 3: ID Name Date 3 C 01/02/2013
Does anyone know how i can go about merging preexisting pdf files and SQL server reporting services output. Can this be done in reporting services? For example, I have 5 pages from a pdf files which is created from another 3rd party software provider. I then i have output from sql reporting services. How can i merge these two outputs and deliver it over .Net/ ASP framework?
We've two OLE DB sources under DFT. TableA from one OLE DB source brings ID's as ( 1, 3, 5 ) and TableB from another OLE DB source brings ID's as ( 0, 3, 6 ). Now would I be able to use merge component to get all non-matching ID's from both tables A & B and store in the OLE DB destination as ( 0, 1, 5, 6 ) [ 1 & 5 from TabelA and 0 & 6 from TableB ]If no, what other option I've to make this req. doable?
Now, I want to merge the source data with target table -that means, if the records are already avaible in target, then ignore and if it does not available then INSERT.
This is the query i used but new records are not getting inserted.
MERGE #target T USING #source S ON S.SOURCE=T.Source WHEN NOT MATCHED BY TARGET THEN INSERT ( Source, Prefix ,tgt_patientcode ,tgt_patientdesc) VALUES ('Canada' , 'cn' , s.patientcode, s.patientcode);
I am trying to implement Slowly Changing dimension transformation using Merge.Meaning both changing and historic attribute is in place. It seems we can use Update only once in Merge, in our scenario we have to update...When the historic attribute also have changed (To update the row as expired, IsCurrent=0)Also When changing attribute is changed. (Historic attribute is same). This case also we need to use Update. I am using CDC to do this. Updated OUTPUT is moving to a temporary table and using Execute SQL task to get updated.
I have a Pivot Transform in SSIS (2005) working perfectly, EXCEPT for that the first column of the output (the date) repeats for each of the following columns, which are themselves falling into the correct column, but not on the same line for a particular date as the others. Snipet of result from Data Viewer is:
IS that possible to get teh output of a execute sql task to excel destination.I have query which will comapre the data difference between two databses. It will comapre all tables in both databses and list out the difference in data by each table. I need to run this query using SSIS and need to get the output to a excel sheet...I have used the data flow task to run this query but my query is giving some error when used with data flow task. So i have used excecute sql task and need to write teh out put to a excel sheet.
I have a requirement to move files from HOLD folder to input folder. In HOLD folder I receive multiple files starting with af, ai, ar i.e. af*.txt , ai*.txt, ar*.txt . I need to move one file at a time to input folder as each file is to be loaded into database before next file is processed. In all the files the SSIS has to look at ai*.txt files first followed by af*.txt and lastly ar*.txt. If there are multiple files of same group the file with oldest date has to be moved first. How do I achieve this?
I am using the following script to check existence of table in the Database and create it dynamically...
This is working when table not existed, it error-ed when the table existed...
This script i am using in the Exec Sql Task.....
[Execute SQL Task] Error: Executing the query "declare @ODSDB varchar(50) declare @SQLSTMT varcha..." failed with the following error: "There is already an object named 'addressTable' in the database.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
declare @ODSDB varchar(50) declare @SQLSTMT varchar(max) set @ODSDB = 'SampleDB' begin set @SQLSTMT = ' IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(''' + @ODSDB + '.dbo.addressTable'') and Type=''U'')
I'm writing a custom source component that reads data from a SharePoint list with dynamic mapping to output columns. It's my first custom component and it's based on several samples and tutorials from Internet
Output columns are not created by the component itself, they must be added by user at design time. The component makes dynamically an association between SharePoint fields and available output columns at run-time (based on an mapping table).
I made a very basic skeleton and I encounter a problem when I add a column to output: it has no datatype and when I try to set one I have an the error Property value is not valid, The component xxxxxx does not allow setting output column datatype properties.
Imports System Imports Microsoft.SqlServer.Dts.Pipeline Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper <DtsPipelineComponent(ComponentType:=ComponentType.SourceAdapter, DisplayName:="SharePoint Dynamic Assoc List Source",
I am downloading a webpage as a text file in order to read a specific string to assign it as a variable/parameter in order to create an output file name. I would like to know how would I be able to look for a specific string and output as another variable for the rest of the package.
2015 Conforming Loan Limits ------------------------------------------------------------------------ o _Loan Limits for Calendar Year 2015--All Counties _[XLS] </DataTools/Downloads/Documents/Conforming-Loan-Limits/FullCountyLoanLimitList2015_HERA-BASED_FINAL_FLAT.xlsx>_ , _[PDF] </DataTools/Downloads/Documents/Conforming-Loan-Limits/FullCountyLoanLimitList2015_HERA-BASED_FINAL.pdf>_ o _List of 46 Counties with Increases in Loan Limits for 2015
[Code] ...
To explain it a more better way, I have a sample webpage text here. I should be searching for "FullCountyLoanLimitList" appended by the current year (like FullCountyLoanLimitList2015) and copy the entire file name in the text file and assign it to another variable so that I can download that specific file using WebClient connection.
We are building a dataload application where parameters are store in a table. And there are multiple packages for each load.There is a column IsChecked column if it is 1 then only the child package should execute.Created a master package. In which i have taken execute SQL task in that storing a results in variable and based on the result the child package should execute. But In executesql task i selected result set as full result set. I am getting the below error.
[Execute SQL Task] Error: Executing the query "SELECT isnull(ID ,0) AS ID FROM DataLoadParameter..." failed with the following error: "The type of the value (DBNull) being assigned to variable "User::LoadValue" differs from the current variable type (Int32). Variables may not change type during execution. Variable types are strict, except for variables of type Object.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
I have implemented a package to load multiple files to a destination. Since the source was a txt file, i have created as flat file source. However now we are getting files in excel format as well.
Is there anyway the source gets changed dynamically based on the file extension, output of the foreach file enumerator? I can think one solution to have 2 dataflow tasks based on precedence constraining and expression one is for .txt and other one is for .xls.
TABLE_NAME DESC CODE tab1 table1 A tab1 table1 B tab1 table1 C tab2 table2 D tab2 table2 E tab2 table2 G...
First column values are table names which are already exists in target database. Next two columns[Desc],[Code] data gets populate from CSV file to table.
In this scenario, how to load tab1 data into the same table in destination and so on.
Which way will be more standard to accomplish this task? If its a script task using C#, looking for clear script to identify a value changes in the first column.