Is It Possible To Lookup Value Based On Two Tables Using Lookup Task
Jun 27, 2007
Hi All,
Actually this is in regard to SCD Type 2 Dimension, Scenario is like that I am moving Fact table from some old source and I have dimensionA description value in fact which I want to replace with appropriate id from Dimension Table and that Dimension table is SCD Type 2 based on StartDate and EndDate and Fact Table doesn't contains direct date value rather there is timeId in Fact so to update the value in Fact table I have to Join Time Dimension table and other Dimension Table to replace fact Description with proper Id.
I try to convert a Procedure that join 8 tables with INNER AND OUTER JOIN, my understanding is that the Lookup task is the one to use and I should break these joins into smaller block, it takes a long time to load when I do this, since each of these tables had 10-40mill. rows and I have 8 tables to go thru, currently this Stored Procedure took 3-4min to run, by converting this to 8 Lookup tasks, it ran for 20min. has anyone run into this issue before and know a work around for this.
We did some "at scale" fuzzy lookup tests today and were rather disappointed with the performance. I'm wanting to know your experience so I can set my performance expectations appropriately.
We were doing a fuzzy lookup against a lookup table with 25 million rows. Each row has 11 columns used in the fuzzy lookup, each between 10-100 chars. We set CopyReferenceTable=0 and MatchIndexOptions=GenerateAndPersistNewIndex and WarmCaches=true. It took about 60 minutes to build that index table, during which, dtexec got up to 4.5GB memory usage. (Is there a way to tell what % of the index table got cached in memory? Memory kept rising as each "Finished building X% of fuzzy index" progress event scrolled by all the way up to 100% progress when it peaked at 4.5GB.) The MaxMemoryUsage setting we left blank so it would use as much as possible on this 64-bit box with 16GB of memory (but only about 4GB was available for SSIS).
After it got done building the index table, it started flowing data through the pipeline. We saw the first buffer of ~9,000 rows get passed from the source to the fuzzy lookup transform. Six hours later it had not finished doing the fuzzy lookup on that first buffer!!! Running profiler showed us it was firing off lots of singelton SQL queries doing lookups as expected. So it was making progress, just very, very slowly.
We had set MinSimilarity=0.45 and Exhaustive=False. Those seemed to be reasonable settings for smaller datasets.
Does that performance seem inline with expectations? Any thoughts to improve performance?
I'm working with an existing package that uses the fuzzy lookup transform. The package is currently working; however, I need to add some columns to the lookup columns from the reference table that is being used.
It seems that I am hitting a memory threshold of some sort, as when I add 3 or 4 columns, the package works, but when I add 5 columns, the fuzzy lookup transform fails pre-execute:
Pre-Execute Taking a snapshot of the reference table Taking a snapshot of the reference table Building Fuzzy Match Index component "Fuzzy Lookup Existing Member" (8351) failed the pre-execute phase and returned error code 0x8007007A.
These errors occur regardless of what columns I am attempting to add to the lookup list.
I have tried setting the MaxMemoryUsage custom property of the transform to 0, and to explicit values that should be much more than enough to hold the fuzzy match index (the reference table is only about 3000 rows, and the entire table is stored in less than 2MB of disk space.
Say I want to lookup a value in another dataset, but there is a grouping that requires you to know what the values for each level is in order to get to the correct detail record.  Can you still use the lookup function with more than one field to compare against? So for example
Department \___SalesPerson     \___Measure
I want to be able to add a new row at the Measure level, but lookup each field from another dataset. In order to do that I will need the Department AND SalesPerson values to do the lookup, but I dont think the Lookup function will let us do that will.
I am doing a lookup that requires mapping 2 columns in the column mapping section. When I do this, I get the error "Row yielded no match during lookup" . The SQL that I captured in SQL profiler does find the record when I run it in Management Studio. I have already tried trimming everything to no avail.
Why is this happening?
I tried enabling memory restrictions but then I my package hangs and I get a SQLDUMPER_ERRORLOG.log file with the following logged:
My lookup data is in a csv file, not a table. Is there a way to get the Lookup transformation to use the csv file as the source 'table'? Obviously the alternative is to load the file into a SQL Server table and use that, but I want to keep it simple if possible.
I've looked at so many examples on the web but I can't get my dataflow with lookup to work correctly. It just keeps inserting the records instead of updating them because they already exist. I know it's something simple but I've been trying to figure it out for the last 4 hours and am over my head. In the lookup I have 5 columns as available input and of course the matching lookup columns and selected 4 with lookup operation replace and tried add as new. It just keeps adding them to my destination table when they already exists. I can't figure out what i am doing wrong?
Is there a way to set the Lookup task to be not case sensitive? For example, the lookup table has id 1, value 'ABCD'. The value that I'm using to lookup is 'abcd'. I cannot make it to return the id 1 unless I convert my lookup value to be all upper case.
i'd like to use ssis on a certain project but am concerned that one of my transformations needs lookup results to be based on actions taken on previous lookups and that the toolkit doesnt really offer something like that.
so, i have a dataflow whose first component extracts certain kinds of data from an xml document.
each row returned by the latter needs a lookup but the results of that lookup may dictate a certain kind of update. The next row's lookup may need to be influenced by the previous row's update.
So I think I have two challenges, 1) combining a lookup and update, 2) making sure the buffer architecture completes one lookup and update before the next lookup begins.
I have a Conditional Split with 3 outputs. On the first output I have a lookup, when I execute the package I have 56 rows going through the Conditional Split, all rows are then going to the 2nd and 3rd output but the lookup on the first output generates an error "Row yielded no match during lookup".
I don't understand why the lookup is generating an error while there is no row going through it.
I have an SSIS package that unpivots data - each row from the source becomes 20-24 rows in the destination db. In the dataflow it looks up some type-data IDs before inserting into the destination table. The whole process flies through the data at an incredible rate, however I find that the final commit on data insertion takes a very long time, and two of the lookup tasks remain yellow throughout. It appears that there may be a contention issue because the final table has FKs to those lookup tables.
In T-SQL it is possible to do SELECT..... WITH (NOLOCK), but when I modified the SQL statement of the lookup task by adding this clause, it caused an error "Incorrect syntax near the keywork 'with'."
Hi, I'm using the lookup task to perform updates on some records and it seems to be working correctly but isn't making all updates. I have a case where there are 543 rows in my source connection, there are 436 of them that do not match the date column of the 543 in my destination table. It only updated 224 of them when it should have updated all 436? Does anyone know why it would only update this many instead of all? I also had another that only updated 18 out of 19 instead of all 19.
I am trying to use this painful new SSIS process. I basically need to use a lookup task to check to see whether a record exists or not. If not, then I need to insert the record. However, because this is treated as an error situation (which is stupid in itself), I get a problem when the number of records not found reach the MaximumErrorCount, and the rest of the package fails. Is there any other method of doing this type of thing, without simply increasing the MaximumErrorCounty to some ludicrous value. I could do this type of thing very very very easily when using DTS packages using the Data Driven Task, it seems so stupid that I can't perform the same kind of task using SSIS.
Hi, My SQL Server database is case in-sensitive. I have a lookup task to join on varhar column. If the data is only case different such as 'DataModel' and 'Datamodle', lookup task consdiers them as different. Is there a way to set up lookup task as case insentive? I know I can work around it by converted them to upper case and compare them.
I am designing a ssis package,This is intends to mine text data(Data extracted from websites). Term lookup/Term extraction has been used as tools for mining. I have lookup terms defined with me for reference table,but the main problem lie in extracting the nearby text/number/charcters to these lookup terms during mining. For example : I found noun "Email" 200 (frequency score) times in my text,Now I want to extract nearby email address(this is also true for PhoneNumber,Address attributes also).so how can I achieve this with SSIS. If u have some idea/suggestion to carry out this challenge with or without Term Extraction/Term Lookup,plz do write here.
I have a requirement regarding a color combination data. I have a lookup table that holds a colorid, p1, p2, p3, p4 to p8 which will be having colors Red, Green and Amber. P1 to P8 columns holds these three colors based on their combinations.
I have attached the look up table data for reference.I need to pass the color values to p1 to p8 and need to retrieve the color id based on the passed color. If we pass values for all p1 to p8 then it is easy to get the color code, however it will not happen. The passed values may be dynamic. ie we will not have all 8 values all the times. sometimes we will have 2 colors passed, sometimes 5 colors will be passed.
If i pass only two colors say red and red, i need the color id of only the row that has red and red for p1 and p2 alone. i dont want want all the colorid's that has red and red in p1 and p2 and some other colors in p3 to p4.
The exact colorid of the combination must be returned on passing the values to p1 and p2.I am passing Red and Red as values to P1 and P2. In the look up table we can have 10 rows that has red and red i p1 and p2 like
So the result must have only the colorid 1 and not all the colorid's listed above. when I pass 3 red as values for p1, p2, p3 then the result must be 10. Colorid 1, 20, 30, 40, 50, 60 and 70 must not come in the result.I need a function or procedure that will accept the arguments and provide me the result based on the values.
Can anyone explain to me (or show me where to get info) in what scenario should I use Full Cache, Partial, or No_Cache mode in the Lookup task? I got below warning message during execute my package. I understand that my lookup data have duplicate reference keys. I want to change the mode to partial or no cache to see how it works, but don't know what to change for the parameter map.
Warning: 0x802090E4 at Data Flow Task, Lookup [37]: The Lookup transformation encountered duplicate reference key values when caching reference data. The Lookup transformation found duplicate key values when caching metadata in PreExecute. This error occurs in Full Cache mode only. Either remove the duplicate key values, or change the cache mode to PARTIAL or NO_CACHE.
So i am designing a new database that currently has several tables 'look up tables' that are used just to limit the values of columns in other tables.
my question is what is the best way to do this?
1. multiple tables - one for each set of values (ex: JobType, Position, PayGrade) 2. One large table that holds all the lookup values - has a 'Category' field to group them 3. put constraints on the columns of the tables that are 'looking up' and get rid of the lookup tables.
We have a LKP table that we will use to decode incoming codes from multiple sales feeds to link to the relevant surrogate key for the ultimate dimension table.
The LKP table has a start date, end date and active row flag to track history of the codes.
This table in its entirety is required for the FULL history load but we will need a CURRENT version of the table containing only the ACTIVE rows for the DAILY incremental loading.
I am thinking of putting a standard view over the top of the LKP table to present only the ACTIVE rows rather than creating a 2nd table and reloading it each night. We don't have enterprise edition to use materialised views.
My question is, will the LKP cache populate as the SSIS packages begins, loading it with all the rows from the view, as it would if I used the 2nd physical table with only ACTIVE rows?
SSIS data flow transformation - Lookup task - best practice concerning Cachetype
I would like to know if there's any best practice concerning the CacheType property for the the Lookup task. Default value is "Full", but if the SSIS package is working with at lot of data, i.e. +10 mill. records from the OLE DB source to be handled through a variated numbers of data flow transformation tasks, it must have an impact memory usage if the lookup table also is a large table, i.e. +8 mill. records? When should I consider turning the property value to "none"
I Can't reproduce the error if I run the package stand-alone.
I'm using the same lookup call (same table, etc.) in 2 packages that are running in parallel (called by a parent package).
[LKP_UnderwriterId [72283]] Error: An OLE DB error has occurred. Error code: 0x80040E05. An OLE DB record is available. Source: "Microsoft OLE DB Provider for SQL Server" Hresult: 0x80040E05 Description: "Object was open.".
Why is there no Expressions collection for the Fuzzy Lookup task? Is there another good way to dynamically configure the ReferenceTableName and MatchIndexName properties?
I'm creating an application that will allow users to contribute "content". The content can be tagged, saved as "my content", etc... very web 2.0'ish. The site will rely very heavily on SQL Server 2005. By default, there is a Content table that simply stores the content with an identity column PK. There is also a Tag table and User table. What is the most effective schema for speed and reliable scalability? Eventually there could be hundreds to thousands of people contributing and tagging content. Idea 1: Lookup TablesSimply make a new table that holds the TagID and UserID. The table will be very important as it will be queried very regularly to show the users which tags they have stored.Pros: This will effectively allow me to store & query what tags the user has selected. It's simple to setup and the application won't have to work much with the data returned from the query.Cons: What happens when there are thousands of users tagging content? At what point does it become very inefficient to query a table that has a huge number of rows? Idea 2: Comma Delimited ListSimply has a field in the user table that has a comma delimited list of TagID's the user has selectedPros: Keeps table size low, fairly easy to implement.Cons: Application has to perform more work. It has to separate the TagID's by comma, and requery the database to get each tags data, based on TagID. Those are basically the only two methods I've really got experience using. The reason for this post is to see if there are methods that I'm not aware of that are better suited for what I'm trying to do. Any assistance is greatly appreciated, thank you in advance!
Hi,If you have lookup tables which are used by multiple tables (e.g.'City' lookup table might be used by the 'Employee' and the 'Company'tables) do we need to link it to both tables? Or can it sit by itselfand just referenced?Cheers,Jack
Is it possible to parameter the connection of a Lookup Transformation task - specifically the table/view name? I would like to be able to dynamically set the table that the Lookup Transformation is connecting to at runtime.I've looked into the "Use results of an SQL query" on the connection screen (which correlates to the "SqlCommand" property), but I'm unable to pass in a parameter this way.I've also looked into the SqlCommandParam, but that doesn't allow me to use a parameter in the "FROM" clause of the sql syntax.
It's been always said that it is best to put index on commonly joined fields in the table. But putting too much index on the table would cause the table to be slow on insert and update.
My question is, how do you deal with your fields that uses look up tables? Like for example for these fields
Those fields don't come a big part in the table, though when I query the table I always join them with their respective primary table to get their respective text value. Do I still need to put Index & FK relationships to these fields?
What fields are normally good candidates for index or fk relationships?
Two tables:T1 (c1 int, TestVal numeric(18,2), ResultFactor numeric(18,2))--c1 isthe primary key.T2 (x1 int, FromVal numeric(18,2), ToVal numeric(18,2), Factornumeric(18,2))--x1 is the primary key. T2 contains non-overlappingvalues. So for eg., a few rows in T2 may look like.1, 51, 51.999, 512, 52, 52.999, 52........32, 82, 82.999, 82........T2 is basically a lookup table. There is no relationship between thetwo tables T1 and T2. However, if the TestVal from T1 falls in therange between FromVal and ToVal in T2, then I want to updateResultFactor in T1 with the corresponding value of Factor from the T2table.------Example for illustration only---------------Even though tables cannot be joined using keys, the above problem is avery common one in our everyday life. For example T1 could beemployees PayRaise table, c1=EmployeeID, with "TestVal" representingtest scores (from 1 to 100). T2 representing lookup of the ranges,with "Factor" representing percent raise to be given to the employee.If TestVal is 65 (employee scored 65% in a test), and a row in T2(FromVal=60, ToVal=70, Factor=12), then I would like to update 12 intable T1 from T2 using sql;. Basically T2 (like a global table)applies to all the employees, so EmpID cannot serve as a key in T2.---------------------------------------------------------Could anyone suggest how I would solve MY PROBLEM using sql? I wouldlike to avoid cursors and loops.Reply appreciated.Thanks
I need to match each car to it's respective class. I've already search the database for post on this subject, but unfortunatelly my goal is yet to be reached. Can someone help?
The logic I am trying to recreate via SSIS is the following SQL statement:
insert into db3.dbo.targettable1 -- Target database table (SiteC, Objecte, Attrib1)
select distinct ?, ?, from ? -- Source database table join dbo.targettable2 c1 -- Target database table on c1.Alias = ? and c1.CSetID = ? and c1.FacID = (select f.PFacID from dbo.Fac f where f.FacID = ?) where not exists (select * from dbo.targettable2 c -- Target database table where c.Alias = ? and c.FacID = ? and c.CSetID = ?)
I have an OLE DB Source that consists of an expression to approximate the following portion of the Above Select statement:
Select ?, from ? -- Source database table and
The package has 2 global variables User:CSetID and User::FacID whose scope is global to the package and whose values are set within a Foreach Loop Container outside of the Data Flow Task
I was trying to reference the 2 global variables within the Looup Transformation to recreate the following portion of the SQL statement.but encounter errors:
join dbo.targettable2 c1 -- Target database table on c1.Alias = ? and c1.CSetID = ? and c1.FacID = (select f.PFacID from dbo.Fac f where f.FacID = ?)
In the Advanced Editor window of Lookup Transaction
select * from (select * from [dbo].[targettable2 ]) as refTable where [refTable].[Alias] = ? and [refTable].[FacID] = ? and [refTable].[CSetID] = ?
Is there away to reference global variables in a Lookup Transformation that are set outside a Data Task Flow?