Merge Join: Nr Of Output Rows Unchanged When Amount Of Input Changes
May 25, 2007
Dear all,
I created a package that seems to work fine with a small amount of data. When I run the package however with more data (as in production) the merge join output is limites to 9963 rows, no matter if I change the number of input rows.
Situation as follows.
The package has 2 OLE DB Sources, in which SQL-statements have been defined in order to retrieve the data.
The flow of source 1 is: retrieving source data -> trimming (non-key) columns -> sorting on the key-columns.
The flow of source 2 is: retrieving source data -> deriving 2 new columns -> aggregating the data to the level of source 1 -> sorting on the key columns.
Then both flows are merged and other steps are performed.
If I test with just a couple of rows it works fine. But when I change the where-clause in the data source retrieval, so that the number of rows is for instance 15000 or 150000 the number of rows after the merge join is 9963.
When I run the package in debug-mode the step is colored green, nevertheless an error is displayed:
Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Merge Join" (4703) failed with error code 0xC0047020. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
To be honest, a few more errormessages appear, but they don't seem related to this issue. The package stops running after some 6000 rows have been written to the destination.
I have a problem with a Merge Join providing no output (when it should have 1890 rows). My Data Flow Task has 4 OLE Data Sources, 3 Multicasts, and 1 OLE Data Destination. I am experiencing the problem near the end of my data flow where two Multicasts create two parallel flows of data (see Level 1 below). I have two Merge Joins which join one leg from each multicast with a leg from the other multicast (see Level 2 below). Then the two remaining legs use a Merge to get my destination output (see Level 3 below).
I am experiencing my problem with the Merge Join (input A2, B2) --> (output C2) transformation. The Merge Join providing output C1 appropriately outputs 1890 rows, but C2 outputs 0 rows. Both Merge Joins are identical. The data is identically sorted prior to entering the problematic Merge Join and a DataViewer (Grid) verified that the data is appropriately entering in. Merge Join (input A2, B2) --> (output C2) has 667 rows as input A2 and 1890 rows as input B2 (using an inner join, just like the other merge join), but C2 baffles me with 0 rows of output (when it too should have 1890). I receive no Ouput errors and the execution completes showing all green.
I read about mysterious behavior with Merge Joins and have attempted modifying my EngineThreads property to values between 2 and 10, with no luck. Any help/ideas would be appreciated.
I am using SSIS in SQL Server Enterprise 2005. I have two OLE DB data sources from two disparate databases (IBM DB2 and Microsoft SQL Server), some columns from each of which are to be included in the merged output results. I have noted the various requirements in the forum postings with regard to sorting the OLE DB sources and specifying the output source columns as being sorted, as well as the requirement that the join fields in the two sources be close/exact matches. Yet, when I run this in VS, while the work area reflects the expected number of rows being input into the Merge Join transformation, no count is reflected as output from that transformation into the final destination table.Specifically, my two data sources (IBM DB2 and MS SQL) are configured as follows:
IBM DB2 contains an SQL statement that uses Cast operations to create the result columns.and an ORDER BY clause to ensure that the output is sorted by the desired two columns.. The OLE DB source property setting for IsSorted is set to true; the Output Columns folder column definitions for "key_ source_dtsy" and "key_source_dtrt" have their SortKeyPosition properties set to 1 and 2, respectively. Those field are both defined as data type DT_STR, with lengths of 4 and 2, respectively. Below is the Path metadata from the Data Flow Path editor from the path from this source:
IBM DB2 source"Name"Â "Data Type"Â "Precision"Â "Scale"Â "Length"Â "Code Page"Â "Sort Key Position"Â "Comparison Flags"Â "Source Component""ID_CODE"Â "DT_STR"Â "0"Â "0"Â "10"Â "1252"Â "0"Â ""Â "Source F0005 User Defined Codes""CODE_DESCR_1"Â "DT_STR"Â "0"Â "0"Â "30"Â "1252"Â "0"Â ""Â "Source F0005 User Defined Codes""CODE_DESCR_2"Â "DT_STR"Â "0"Â "0"Â "30"Â "1252"Â "0"Â ""Â "Source F0005 User Defined Codes""key_source_dtsy"Â "DT_STR"Â "0"Â "0"Â "4"Â "1252"Â "1"Â ""Â "Source F0005 User Defined Codes""key_source_dtrt"Â "DT_STR"Â "0"Â "0"Â "2"Â "1252"Â "2"Â ""Â "Source F0005
User Defined Codes:
MS SQL contains an SQL statement that takes the columns as they are in the MS SQL table (no Cast operations needed); it also uses an ORDER BY clause to ensure the output is sorted by the join columns. The OLE DB source property setting for IsSorted is set to true; the Output Columns folder columns for "key_source_dtsy" and "key_source_dtrt" have their SortKeyPosition properties set to 1 and 2, respectively. Those field are both defined as data type DT_STR, with lengths of 4 and 2, respectively. Below is the Path metadata from the Data Flow Path editor from the path from this source:
MS SQL source"Name"Â "Data Type"Â "Precision"Â "Scale"Â "Length"Â "Code Page"Â "Sort Key Position"Â "Comparison Flags"Â "Source Component""id_code_name"Â "DT_I2"Â "0"Â "0"Â "0"Â "0"Â "0"Â ""Â "Source CodeName in db dwVdFY""key_source_dtsy"Â "DT_STR"Â "0"Â "0"Â "4"Â "1252"Â "1"Â ""Â "Source CodeName in db dwVdFY""key_source_dtrt"Â "DT_STR"Â "0"Â "0"Â "2"Â "1252"Â "2"Â ""Â "Source CodeName in db dwVdFY"
The Merge Join transformation specifies an INNER JOIN using the columns named "key_source_dtsy" and "key_source_dtrt" from the respective data sources.I know there are alternative ways of accomplishing my intent (Lookup, port MS SQL table to IBM DB2 so join can occur in SELECT statement, etc.; however, I'd like to use this functionality and assume that it should work.Â
We are running merge replication, SQL Server 2005 Enterprise with SQL Mobile 2005 (Windows Mobile 5) subscribers. Partitions are filtered on HOST_ID.
Ocassionally we experience a situation where a subscriber experiences an unusually long synchronization duration, and upon examining Replication Monitor, it appears that twice the number of rows, or X the number of rows (up to 7 times the number of rows) that should have been inserted are recorded as synchronized for the session: once the normal amount as inserts and once the normal amount as updates. This occurs for all tables in the subscription. This occurs on a first time synchronization to an empty subscriber database where there should be only inserts taking place.
I have examined the snapshot partition folders for these users on the file system and they appear to be identical in size and content as other subscribers. Checking the last partition snapshot job run and other characteristics for the the subscriber in question, everything appears to be the same as other subscribers functioning normally.
The HOST_ID for us is an employee ID used to filter employee specific data. I have seen this happen if the subscriber changes the value for the HOST_ID used in filtering, after the mobile database has already been populated (2 employees attempt to use the same mobile device and database). But, we have seen this happen recently where the HOST_ID was apparantly never changed midstream.
This just started happening recently. The only modification around the same time frame was the implementation of a custom business logic handler/custom conflict resolver that performs like "Latest Wins" but has logic added to update the a last-update datetime column for selected transaction tables at time of synchronization, so that an SSIS job can detect the changed rows for copying incremental database changes to another application database. This all seems to be working perfectly.
I'm trying to use the Merge component. When i attach a datasource to the the component, the Select Input/Output dialog box should popup.. It does, but VS.NET is hanging and i can only shutdown the procesess...
Any idea how i should solve this? how can i re-register this component?
Hi I have a strange issue here. Both the input datasets of the merge join are sorted . Everything works fine if the dataset doesn't contain speacial characters like - (hipen). This is the sorted dataset.ASI311_3ASI311_3ASI312ASI311ASIKit1ASIKit1
Merge Joiner has to look for ASI311 and finds it in the list and everything works as expected. But if the dataset is changed to
ASI311_3ASI311_3ASI-312ASI311ASIKit1ASIKit1
Then it doesn't find the ASI311 in the above dataset ? Only change between the two datasets is the - (hipen) which is added in one of the field. Does SSIS has a differenent precedence of the special characters than that of the Database. Both the datasets are on SQLServer.
I've run into something that looks like a bug to me but I wanted to run it by the board:
Merge join 2 sorted tables.
Table1: ColumnA : Sort Order 1, ColumnB Sort Order 2
Table2 : ColumnA: Sort Order 1, ColumnB Sort Order 2, ColumnC not sorted
Merge Join the two tables on ColumnA and ColumnB...
Choose the following as output columns
A + B + C = works
C = works
A + C = works
B + C = NOT work.. error message: The column with the SortKeyPosition value of 0 is not valid. It should be 2.
Basically if you choose one or more of the sorted columns in the output at least one of them has to be the column with Sort position 1 or you'll get that error.
Is this a bug or intentional? If you do not have sort column 1 in the output that output could no longer be considered sorted... so perhaps the error is related to that (instead of error I'd expect some warning about the sorting). Interesting that it lets you choose C only becuase that also makes the output unsorted.
I have a sql statement that joins two tables and I get back a few thousand records when I run it in query tool in management studio.
But when I use SSIS merge join to join the two tables my output is 0 records.
I did sort the key column in both tables by setting 'sortkeyposition' property to 1 in advanced editor for output of both tables.
however the merge join returns nothing to my destination tables. Also I am doing a inner join. The task runs without error but returns nothing as well.. any ideas?
My dimension table has a column called AccountNumber varchar(20), which is the business key.
I use SCD task to manage TYPE 2 changes on a particular column.
However, the SCD component doesn;t seem to recognize the change on the column. The datatype of the column in the dimension table and the datatype of the input column is same.
The datatype for the business key , which is varchar(20) is also same as the input column business key.
However, when I ran SQL profiler to see what is going on, I found that the SELECT statement
that was issued with the business key in the WHERE clause has char(20) instead of varchar(20)
for the business key. So, for example if the account number in the dimension table is '1234',
the SELECT statement is checking for '1234 '.
I trimmed the incoming string but I still get the problem. Any help would be appreciated.
I'm doing a data conversion with one of my fields (SUMDWK) from one of the tables that will be used in a merge join. With the new, converted field, I do a look up. From this look up, I want to take a new field FiscalWeekOfYear, and replace the original field, SUMDWK. This is necessary because SUMDWK is one of the sorted fields. In the look up, it is not possible to change the Output Alias. Does anybody know a way around this? Thanks.
In the first image as can be seens i have 2 different data sources and then they are being joined using "Merge Inner Join". The "sort" is on BusinessEntityID column of Person table and "Sort1" is on "PersonID" of Customer table. The merge join of these 2 result in 19,119 rows.
On the other hand, if i use single data source and use a query with inner join on  tables  used in the first image (ie. 2 tables being used in 2 different data sources) as depicted in second image. Also,  since merge cannot operate without SortKey i have defined TerritoryID as sort key in the advanced editor. The number of rows i get after this is "10,274". My select query was :
SELECT P.BusinessEntityID, P.PersonType, P.Title, P.FirstName, P.MiddleName, P.LastName, P.Suffix, C.TerritoryID FROM stg.Person AS P INNER JOIN stg.Customer AS C ON C.CustomerID = P.BusinessEntityID ORDER BY C.TerritoryID;
According to me, it should have been the same as in first case i am using merge inner join and in second case i am using SELECT query with inner join. Upon drilling down i found that in the first case , my sort keys are BusinessEntityID  and PersonID, if i modify this to CustomerID  and BusinessEntityID as this is my join condition (in ithe inner join query shown above), i get the desired output. What i was wondering was, how  the sort order change the Join Condition?
How to summarise the data in this table to a single row output per site (2 records for every SiteID). I am based in the UK so the data copied from SQL is in the US format. I can convert this to UK date without any issues.
CREATE TABLE [dbo].[MRMReadings]( [SiteIncomeID] [int] IDENTITY(1,1) NOT NULL, [SiteID] [nchar](5) NOT NULL,
[Code] ....    Â
Is it possible to return the data in the following format merging 2 lines of data into one output:
 SiteID  ReadStartDate ReadEndDate    ReadStartIncome    ReadEndIncome L0020    19/05/2015 05:00 20/05/2015 05:00   85.98    145.98 L0101    19/05/2015 22:07   20/05/2015 22:14         1,936.08      1,438.89 L0102    20/05/2015 21:16   19/05/2015 21:48  143.65 243.5
I need to transform Foxpro table to SQL Server table with merging all rows into one where all column values are the same except one . For this the only column with the different values , I want them also to be merged as coma or space delimited string. The question whether SSIS is a good candidate for this kind of data munging and also would be interested to know knowing as many as possible ways of doing that. Surely I may produce Foxpro script in 5 minutes which wil do that and be a pre-processor action before SSIS starts.
Hi, guys.A very simple question for all of you: how can I get the amount ofbytes exchanged during a Merge replication between two Microsoft SQL2000 servers?Thank you.Bye,Angelo.-
I have a routine that generates an HTML email and sends it just fine, but one of the columns ends up with 4 decimal places for a column datatype of money. How can I get the script to output only 2 decimal places for the amount column from the select statement?
please help to select these rows from these tables my tables aretable1table1Id groupId table2id price1 1 1 102 1 3 10003 1 4 5004 2 1 55 2 3 10006 2 2 2000table2table2id name1 hello2 test3 test14 test2ok i want to select maximum priced row from table1 grouped by groupidwith table2id and table2.namemy out put isgroupid price table2id table2.name1 1000 3 test12 2000 2 testif i do :select groupid,max(table1.price) as price from table1 group by pricethis will give me the max priced row from table1but when i join them with the table2 it gives me all rowslikeSelect groupid,max(price) as price,table2id,table2.name from table1inner join on table2group by groupid,table2id,table2.nameit gives me all rows cause i had to group by table2.id and table2.namebut i can't take it out cause it give me no aggrigated value errori can't figure out any other way, please helpSQL Server 2000thankseric
OLEDB source 1 SELECT ... ,[MANUAL DCD ID] <-- this column set to sort order = 1 ... FROM [dbo].[XLSDCI] ORDER BY [MANUAL DCD ID] ASC
OLEDB source 2 SELECT ... ,[Bo Tkt Num] <-- this column set to sort order = 1 ... FROM ....[dbo].[FFFenics] ORDER BY [Bo Tkt Num] ASC
These two tasks are followed immediately by a MERGE JOIN
All columns in source1 are ticked, all column in source2 are ticked, join key is shown above. join type is left outer join (source 1 -> source 2)
result of source1 (..dcd column) ... 4-400-8000119 4-400-8000120 4-400-8000121 4-400-8000122 <--row not joining 4-400-8000123 4-400-8000124 ...
result of source2 (..tkt num column) ... 4-400-1000118 4-400-1000119 4-400-1000120 4-400-1000121 4-400-1000122 <--row not joining 4-400-1000123 4-400-1000124 4-400-1000125 ...
All other rows are joining as expected. Why is it failing for this one row?
I need to create an output from a T-SQL query that picks a numeric variable and uses the print function to output with leading zeroes if it is less than three characters long when converted to string. Â For example if the variable is 12 the output should be 012 and if the variable is 3 the output should be 003.
Presently the syntax I am using is PRINT STR(@CLUSTER,3) .  But if @CLUSTER which is numeric is less than three characters I get spaces in front.
I have a merge join (full outer join) task in a data flow. The left input comes from a flat file source and then a script transformation which does some custom grouping. The right input comes from an oledb source. The script transformation output is asynchronous (SynchronousInputID=0). The left input has many more rows (200,000+) than the right input (2,500). I run it from VS 2005 by right-click/execute on the data flow task. The merge join remains yellow and the task never finishes. I do see a row count above the flat file destination that reaches a certain number and seems to get stuck there. When I test with a smaller file on the left it works OK. Any suggestions?
TotalSelected.Value = SqlDataSource1.SelectCommand = "SELECT COUNT(*) FROM tblNews";
the reason i am tring to do this is so if i can find out the amount of rows before sqldatasource selects for details view then i can make the sqldataesource select depends on total minus 5 so e.g. if total 200 then - 5 so i can select bottom 195 so it misses top 5 for details view any1 any ideas? Thanks Andy,
Hello,I have experienced that some of my tables occupies an extremely large amountof pages but with few rows. An example is a table with 37 rows over 22000pages !. The columns are simple integer and char. I fixed the problem byintroducing a clustered index. Now it only uses 1 page. But can anyoneexplain this behaviour in SQLServer 2000 ?regards Jakob Mathiasen
Is there a way to do a super-table join ie two table join with no matching criteria? I am pulling in a sheet from XL and joining to a table in SQLServer. The join should read something like €œfor every row in the sheet I need that row and a code from a table. 100 rows in the sheet merged with 10 codes from the table = 1000 result rows.
This is the simple sql (no join on the tables):
select 1.code, 2.rowdetail from tblcodes 1, tblelements 2
I read that merge joins work a lot faster than hash joins. How would you convert a hash join into a merge join? (Referring to output on Execution Plan diagrams.) THANKS
I need to show the total amount of rows in a specific table?
The query is as follows:
As part of the planning process to expand the database that supports Northwind operations, the IT manager would like to know how many rows are currently in specific tables so that he can conduct capacity planning.
The results needed include two columns, TableName( containing all the tables in the database and Rows, which contain the total amount of all the rows per table).
I am trying to get a stored proceedure to return the autogenerated numerical primary key of the last row created to our appliciation. I have created what I thought was an output parameter to handle this however when the application runs I get a message that seems to indicate that it is ASKING for the parameter instead of returning it. Here is the code of the sproc:
Code Block USE [chronicle] GO /****** Object: StoredProcedure [dbo].[CreateNewLicense] Script Date: 01/03/2008 06:35:52 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER PROCEDURE [dbo].[CreateNewLicense]
I have an issue where I have multiple rows of data and I need to reduce a dollar amount by a fixed maximum. I am going to throw some code in here to give a rudimentary idea of the data and what the final result should be.
I need to run an update so that the result of the following query:
select LineNum, Code, Amt, MaxAmt from
@tblLooks like this:
LineNum Code Amt MaxAmt ----------- ---- --------------------- --------------------- 1 AA 10.00 50.00 2 AA 20.00 50.00 3 AA 20.00 50.00
(3 row(s) affected)
I have tried cursors but got unexpected results or the MaxAmt always defaulted to the original even if I updated it. This seems like a simple problem but I have been banging my head against the wall for 2 days now. I've written some pretty complicated updates with less effort than this and I must have some mental block that is keeping me from figuring this out.
I have a problem. I want to know if the time which is needed for creating an index increases proportional to the amount of rows. example: if creating an index on a table which 10.000 rows takes 15 seconds. does creating an index on a table with 20.000 rows take 30 seconds , 40.000 rows 60 seconds and so on... or does it take longer like 10.000 rows 15 second, 20.000 rows 40 seconds, 40.000 rows 80 seconds.
I need to generate a csv file from another csv file, seems to be simple but let's go the trick thing:
Needs to have maximum 1000 lines, if I reach to this, I need to create another csv and fill that new one.
Exemplifying:
I have a csv file called fileA and this has 2000 lines and another csv called fileB with 1500 lines.
I need to loop a folder and get the fileA, create an output called FileAOutput and start to fill that, if I reach to 1000 lines, I need to create a FileAOutput_2 and fill the other 1000 lines...so I'll go to fileB and do the same thing, but in the second case, I'll have 500 lines in the second output.