Fuzzy Lookup Problems
Jun 16, 2006
Hi everyone,
Ive just started looking at the Fuzzy Lookup feature and i think i must be getting something fundamentally wrong. I have two tables - each contain different meta data representations for a set of potentially similar documents. The only chance i have of matching a document in table A to a document in table B is a common title field. However, manual input means that the titles may differ in both tables although they are potentially quite similar in most cases.
In the lookup i get to specify the output columns from table B (Reference) which is fine, but i don't seem to get to choose the columns from table A that i would also like to see. So my output shows me all the documents from table B that it thinks are similar to ones in table A...but not identifying which record it's similar to.
I initially thought that the "pass through" columns that i identified would appear in the output - but this does not seem to be the case.
I must be using it incorrectly, but i have no idea how to progress with this apart from creating a new source table (C) which is a full outer join of table A and B - and then also using table C as the reference table, but that seems madness.
any help would be appreciated - ta
Andrew
View 3 Replies
ADVERTISEMENT
Aug 14, 2007
Dear Friends,
i think fuzzy lookup
COMPARES WHAT WE ARE MAPING THE COLUMNS WITH SPELLING (IT WILL REJECT ATLEAST 1 LETTER IS DIFFRENT IN ANY RECORD MAPPED COLUMN) EX: RAVI != REVI
what is fuzzy grouping ???? please explain
regards
koti
View 3 Replies
View Related
Oct 31, 2007
We did some "at scale" fuzzy lookup tests today and were rather disappointed with the performance. I'm wanting to know your experience so I can set my performance expectations appropriately.
We were doing a fuzzy lookup against a lookup table with 25 million rows. Each row has 11 columns used in the fuzzy lookup, each between 10-100 chars. We set CopyReferenceTable=0 and MatchIndexOptions=GenerateAndPersistNewIndex and WarmCaches=true. It took about 60 minutes to build that index table, during which, dtexec got up to 4.5GB memory usage. (Is there a way to tell what % of the index table got cached in memory? Memory kept rising as each "Finished building X% of fuzzy index" progress event scrolled by all the way up to 100% progress when it peaked at 4.5GB.) The MaxMemoryUsage setting we left blank so it would use as much as possible on this 64-bit box with 16GB of memory (but only about 4GB was available for SSIS).
After it got done building the index table, it started flowing data through the pipeline. We saw the first buffer of ~9,000 rows get passed from the source to the fuzzy lookup transform. Six hours later it had not finished doing the fuzzy lookup on that first buffer!!! Running profiler showed us it was firing off lots of singelton SQL queries doing lookups as expected. So it was making progress, just very, very slowly.
We had set MinSimilarity=0.45 and Exhaustive=False. Those seemed to be reasonable settings for smaller datasets.
Does that performance seem inline with expectations? Any thoughts to improve performance?
View 4 Replies
View Related
Sep 26, 2007
I'm working with an existing package that uses the fuzzy lookup transform. The package is currently working; however, I need to add some columns to the lookup columns from the reference table that is being used.
It seems that I am hitting a memory threshold of some sort, as when I add 3 or 4 columns, the package works, but when I add 5 columns, the fuzzy lookup transform fails pre-execute:
Pre-Execute
Taking a snapshot of the reference table
Taking a snapshot of the reference table
Building Fuzzy Match Index
component "Fuzzy Lookup Existing Member" (8351) failed the pre-execute phase and returned error code 0x8007007A.
These errors occur regardless of what columns I am attempting to add to the lookup list.
I have tried setting the MaxMemoryUsage custom property of the transform to 0, and to explicit values that should be much more than enough to hold the fuzzy match index (the reference table is only about 3000 rows, and the entire table is stored in less than 2MB of disk space.
Any ideas on what else could be causing this?
View 4 Replies
View Related
Feb 16, 2007
Hi,
I am using a fuzzy lookup to cleanse data from a sales line details table, during the import process. The sales order line details contains a filed called 'reference' and this is compared to a field called 'category' in another table.
Using data viewers to check through the cleansing process, I notice that the fuzzy lookup doesn't seem to match i.e.
tbl.salesline.reference = 'I3' -> tbl.sales.category ='I03'
the above is OK, but the lookup also returns the following
tbl.salesline.reference = 'I9' -> tbl.sales.category ='I01'
The value I9 doesnt exist, and is miskeyed by user entry, and should have been 'I99'. I would have expected the fuzzy lookup to pickup the I99 value as at least two of the chrs are matching, but no, it picks the first 'I*' in the table.
If I expand the fuzzy lookup to return more results, i.e. 5 per record, then it returns the first 5 results....I01, I02 I03 and so on.
Is there a way of improving the fuzzy lookup itself?
View 1 Replies
View Related
Feb 6, 2008
The enterprise edition of SQL server includes some advanced BI features, for example the fuzzy lookup feature of IS. If the IS package lives on an enterprise edition of SQL server and the database the package it is targeting lives on a standard edition of SQL server can the advanced features be used? Can you run a fuzzy look against a database on a standard edition of SQL server when th IS package lives on an enterprise edition of SQL server? THANKS!
View 1 Replies
View Related
Jan 19, 2007
Hi Friends,
Can some body briefly explain me what is the difference between fuzzy lookup and fuzzy grouping?
thanks and regards
View 2 Replies
View Related
May 25, 2007
Hi,
Could someone please help!
Im doing a fuzzy lookup based on 3 fields (Surname/DOB/Gender). The only difference between the two sets of data is the case of the first letter of the Surname.
Reference table has "Stuart" Lookup has "stuart", I have set Fuzzy Lookup Input for Surname to Ignore Case but still it won't match.
The DOB/Gender are Exsactly the same.
Why does this not work? I there a work around?
Many Thanks, Deano
View 2 Replies
View Related
May 16, 2006
I am trying to run a SSIS package that contains a fuzzy lookup. I am using a flat file with about 7 million records as the input. The reference table has about 2000 records. The package fails after about 40,000 records with the following information:
------------------------
Warning: 0x8007000E at Data Flow Task, Fuzzy Lookup [228]: Not enough storage is available to complete this operation.
Warning: 0x800470E9 at Data Flow Task, DTS.Pipeline: A call to the ProcessInput method for input 229 on component "Fuzzy Lookup" (228) unexpectedly kept a reference to the buffer it was passed. The refcount on that buffer was 2 before the call, and 1 after the call returned.
Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: The ProcessInput method on component "Fuzzy Lookup" (228) failed with error code 0x8007000E. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.
Error: 0xC0047021 at Data Flow Task, DTS.Pipeline: Thread "WorkThread0" has exited with error code 0x8007000E.
Error: 0xC02020C4 at Data Flow Task, Flat File Source [1]: The attempt to add a row to the Data Flow task buffer failed with error code 0xC0047020.
Error: 0xC0047039 at Data Flow Task, DTS.Pipeline: Thread "WorkThread1" received a shutdown signal and is terminating. The user requested a shutdown, or an error in another thread is causing the pipeline to shutdown.
Error: 0xC0047021 at Data Flow Task, DTS.Pipeline: Thread "WorkThread1" has exited with error code 0xC0047039.
Error: 0xC0047038 at Data Flow Task, DTS.Pipeline: The PrimeOutput method on component "Flat File Source" (1) returned error code 0xC02020C4. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.
Error: 0xC0047021 at Data Flow Task, DTS.Pipeline: Thread "SourceThread0" has exited with error code 0xC0047038.
-------------------------------
I have tried many things - changing the BufferTempStoragePath path to a drive that has plenty space, changed the MaxInsertCommitSize to 5,000...
What else can I do?
Thanks!
View 10 Replies
View Related
Mar 8, 2006
Fuzzy lookup seems to be causing some problems to me. It seems to work at times and doesn't at other times. It would work a couple of times fine and give me the desired results but then without changing anything in the dataflow or the data the next few times it would not run at all and fail the pre-execute of the.
Now I'm currently getting the following error:
[Fuzzy Lookup [248]] Error: An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Login timeout expired". An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections.". An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Named Pipes Provider: Could not open a connection to SQL Server [233]. ".
[DTS.Pipeline] Warning: A call to the ProcessInput method for input 249 on component "Fuzzy Lookup" (248) unexpectedly kept a reference to the buffer it was passed. The refcount on that buffer was 2 before the call, and 1 after the call returned.
[DTS.Pipeline] Error: The ProcessInput method on component "Fuzzy Lookup" (248) failed with error code 0xC0202009. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.
Any help would be appreciated.
View 1 Replies
View Related
Oct 18, 2006
Hi
I get the following error when I use Fuzzy Lookup in a Data Flow task with TransactionOption property set to €œRequired€?
[Fuzzy Lookup [61]] Error: An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Cannot create new connection because in manual or distributed transaction mode.".
When I Change the TransactionProperty to €œSupported€? it works fine.
I need the property set to Required for it does an undo in the event of a failure.
Any ideas on how to get the Fuzzy Lookup to work
View 3 Replies
View Related
Sep 30, 2007
I have a Fuzzy Lookup in a Data Flow Task that is performing a simple text match based on a data view in SQL Server.
I keep obtaining the error below and I have no idea why. Is there a minimum number of rows required in the view in order for the lookup to work properly?
When I take the Store/Manage Index options off the lookup seems to work properly.
Thank you!
[Fuzzy Merchant Lookup [2832]] Error: SSIS Error Code DTS_E_OLEDBERROR.
An OLE DB error has occurred. Error code: 0x80040E14.
An OLE DB record is available.
Source: "Microsoft SQL Native Client"
Hresult: 0x80040E14
Description: "A .NET Framework error occurred during execution of user-defined routine or aggregate "sp_FuzzyLookupTableMaintenanceInstall": System.Data.SqlClient.SqlException: Error number 8197 is invalid. The number must be from 13000 through 2147483647 and it cannot be 50000.
System.Data.SqlClient.SqlException:
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnectionSmi.EventSink.DispatchMessages(Boolean ignoreNonFatalMessages) at Microsoft.SqlServer.Server.SmiEventSink_Default.DispatchMessages(Boolean ignoreNonFatalMessages)
at System.Data.SqlClient.SqlCommand.RunExecuteNonQuerySmi(Boolean sendToPipe)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.RaiseErrorId(SqlCommand cmd, FltmErrorMsgId MsgId, FltmErrorState State, SqlServerSeverity Severity)
at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.ReportErrors(SqlCommand cmd, ExceptionType Type, String ErrorMessage, FltmErrorMsgId MsgId, FltmErrorState State, SqlServerSeverity Severity, SqlErrorCollection errors)
at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.TranWrap(DataCleaningOperation c)
at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.ServerInstall(String etiTableName) .".
View 4 Replies
View Related
Aug 31, 2006
Is it possbile to have multiple fuzzy lookup within a data flow?
I need to have at least 3 fuzzy lookup in a data flow. Here're the conditions that I try to find match: 1=Zip&City, 2=Zip&State, 3=City&State. I've the first fuzzy lookup working fine. After that, I've a conditional split to get any unmatch, then use another fuzzy lookup for a second condition...at this point, I get the error saying "The package contains two objects with duplicate name of output column _Similarity..." I do not need to get the _Similarity and _Confidence, so is there a way to exclude them from returning in the output?
Any comments?
Thanks in advance.
View 4 Replies
View Related
Nov 15, 2007
Hi all
I've been doing some research and running some PoCs on using the Fuzzy Lookup Transformation (FLT) and had two questions:
1) When you choose to have a maximum of 1 output returned for each input, does FLT pick this output based on the best (highest) similarity and confidence scores or the first one it finds?
2) Why does FLT not support dynamically setting properties such as ReferenceTableName or MatchIndexName?
Any help or guidance with this is greatly appreciated.
View 3 Replies
View Related
Apr 3, 2008
I created a fuzzy transformation with an input table and a reference table. When I go to the Columns tab, there are no available input or lookup columns displayed. But if I select a different reference table, sometimes it works.
Are there any specific properties a reference table must have in order for columns to show up?
Thanks,
Tom
View 5 Replies
View Related
Mar 5, 2007
What is the difference between ‘Fuzzy Lookup Transformations ‘ and ‘Lookup Transformations in ssis .any real time senario for better understanding
View 1 Replies
View Related
Jul 31, 2006
I am trying to run the Fuzzy Lookup on a SQL2K ref table using 2005 SSIS package and keep getting the following error:
[Fuzzy Lookup [2601]] Error: An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Cannot create a row of size 8061 which is greater than the allowable maximum of 8060.".
Regardless of the changes I make I cannot get this to work and it would make a huge difference if I could get it to run.
Can I create the FuzzyLookupIndex on a SQL2K database?
Any help or advice would be greatly appreciated.
Many thanks
C.
View 4 Replies
View Related
Apr 13, 2007
Sorry, this might be an obvious question, but I can not find anything in the documentation/forum.
I want to use a Fuzzy Lookup between 2 Oracle tables.
I select the Reference Table.
I then switch to the Columns tab, but the "Available Input Columns" and "Available Lookup Columns" lists are always empty.
I have experimented quite a bit, but to no avail. I noticed this on the Reference Table tabpage : "The table maintenance feature requires the installation of a trigger on the reference table". My guess would be that SSIS does not support Oracle for this, but I am not able to find anything in the documentation that it doesn't.
Any answer/pointer greatly appreciated.
Thanks
Jan Vandepitte
View 5 Replies
View Related
Mar 26, 2008
I have come across something on Fuzzy Lookup and dont know am I doing something wrong or is that the behaviour we are expected to get from Fuzzy Lookup.
I have a Test table as shown below with couple of sample rows.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Test]') AND type in (N'U'))
DROP TABLE [dbo].[Test]
GO
CREATE TABLE [dbo].[Test](
[Code] [varchar](4) NOT NULL,
[Name] [varchar](50) NULL,
[Server] [varchar](50) NULL
) ON [PRIMARY]
GO
INSERT INTO [Test] ([Code],[Name],[Server])VALUES('PQR','CONTROL GEAR (GROUP) LTD','ELPS122')
GO
INSERT INTO [Test] ([Code],[Name],[Server])VALUES('PQR','CONTROL GEAR (GROUP)','ELPS122')
GO
IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[vwTest]'))
DROP VIEW [dbo].[vwTest]
GO
CREATE VIEW [dbo].[vwTest]
AS
SELECT Code, [Name]
FROM Test
GO
OLE DB Data Source - I read the data from Test Table.
Fuzzy Lookup - vwTest is used as Reference Table Name. Joined by Code & Name. Maximum No of matches to output per lookup is set to 5.
Row Count - Data Viewer between Fuzzy Lookup and RowCount
The results as shown below:
Name Name (1) _Similarity_Name
CONTROL GEAR (GROUP) LTD CONTROL GEAR (GROUP) LTD 1
CONTROL GEAR (GROUP) LTD CONTROL GEAR (GROUP) 0.6
CONTROL GEAR (GROUP) CONTROL GEAR (GROUP) 1
CONTROL GEAR (GROUP) CONTROL GEAR (GROUP) LTD 0.8
The result produced by Fuzzy Lookup has shown above.
My question is are we expected to get same similarity value or not. It doesnt produce same similarity value during my testing.
I was expecting same similarity score if I do the following two statements.
Is "CONTROL GEAR (GROUP) LTD" same as "CONTROL GEAR (GROUP)"
Is "CONTROL GEAR (GROUP)" same as "CONTROL GEAR (GROUP) LTD"
I think I know the answer, but I would like to know why though?
Thanks
Sutha
View 7 Replies
View Related
Nov 22, 2007
Hi,
I am using Fuzzy Lookup in my transformation. I wanted to know if there is a way to use variables for the MinSimilarity property in the Advanced Editor tab. Instead of giving a hardcoded value between 0 to 1, I want to take the value from a variable and use it. Is this possible in SSIS.
Thanks,
Akalya
View 3 Replies
View Related
May 29, 2007
Hello,
I have a peculiar problem in my project. My project design is like this
The number in (...) are count of records.
File feed (1000)
|
|
Fuzzy Lookup
against Table2
|
|
Split Fz Lookup results
(_Similarity >= 0.60 && _Confidence >= 0.85)
| |
| |
| Write matches to Table1 (250)
|
Fuzzy Group
Remaining rows (750)
|
|
Split Fz Group results
| |
| |
Write Canonicals Write Dupes
to Table2 to Table1
(300) (450)
This is basically a customer de-dupification project.
The Table2 has the canonicals and Table1 has the dupes (of the canonicals).
I already have some data in these tables and the new data is matched against the existing data
in these tables and classified as new customers and duplicate customers.
In the above process one could notice that the rows identified as dupes of already exsting canonicals
by the Fuzzy Lookup task are written into the dupes table (Table1) and will not be processed further down
the line in the project.
But in my case I see that those matches identified by Fuzzy lookup are further being included in the
Fuzzy Grouping also.
When I run this in debug mode in BIDS, it shows the correct numbers as I have depicted in the
illustration above. But, after execution, when I query the tables it shows that all 1000 rows
went through Fuzzy Grouping.
Any thoughts?
Btw, is there anyway to upload attachments to the postings here?
View 1 Replies
View Related
Dec 14, 2006
I have a SSIS package where a small table of 270 rows are fuzzy looked up with a table in another sql server and inserts the records to a temporary table. This takes more than 3 hours in debug mode or so and never goes beyond this step.I have used a OLE DB destination to insert to temporary table and temporary table doesn't get a value.
View 2 Replies
View Related
Apr 7, 2008
Hi,
I need some advice on fuzzy lookup / grouping design.
I have a requirement that, I think, is between lookup and grouping transformations.
In one of our applications, users can enter manually a label for some information in the database.
Every month, I will store all the new data in our OLAP DB, and I want to group these labels with a fuzzy logic.
Historical data (already loaded) have to be grouped, as well as new data coming every month.
I have no predefined canonical data, so Fuzzy Lookup seems not adapted to my pb.
Fuzzy Grouping seems ok, but it would require to put historical data as well as new data as an input of the Fuzzy Grouping Transfo to constitute groups. This seems not efficient to me.
Any clue ?
M.D
View 1 Replies
View Related
Apr 10, 2008
I have a fuzzy lookup task that compares a source list of contacts to a reference list of contacts with the default settings. I did some testing by adding seed data that I knew would produce somewhat high similarity hits. All of the seeded contacts but one came back with the expected high sim values. When I looked for the one that didn't, I noticed another match had come up but it had a very low similarity of .17. I then did some research and discovered the reason was the MaxOutputMatchesPerInput setting which was set to 1. I then set it to 3 and reran the package and sure enough my seeded contact that was missing before now showed up. I thought the best match would show up if the MaxOutputMatches was set to 1? That is not the case in my testing.
For example, Donna Mizeman was in the reference list. I added Don Miseman to the source list to seed it. The only match that came back was something like Dieman Abdul .... So the initial match had a similarity of .17 but when MaxOutputMatchesPerInput is set to 3 the best match (seeded) has a similarity of .72.
Anyone have an explanation for this?
-Mike
View 1 Replies
View Related
Feb 26, 2008
Can anybody tell me when the index table is created? I have a Ole DB Command transformation that deletes rows from a table, and the pipeline continues on to a fuzzy lookup. The fuzzy lookup is returning matches to some of the rows which were delted in the aformentioned Ole DB Command. If the index table is created during pre-execute, this would make some sense to me since those rows which get deleted still exist before the pipeline reaches the Ole DB Command transformation which deletes those rows. If this is the case, is there a way to delay the index table creation? If this is not the case, has anybody else ran into anything like this...is there some solution? Thanks for any help.
View 1 Replies
View Related
May 25, 2006
I am trying to create a package that reads an input file or input table, does a fuzzy lookup, and outputs results to another table. I was wondering if this can be done programmatically? I have tried adding the fuzzy lookup component like this:
IDTSComponentMetaData90 FuzzyLookupDF = dataFlow.ComponentMetaDataCollection.New();
FuzzyLookupDF.ComponentClassID = "Fuzzy Lookup";
FuzzyLookupDF.Name = "FuzzyLookup";
I wondering how I can change the properties, such as what the input column is, reference table, lookup column, etc? I am not even sure if this can be done; if it can, I'd like some guidance on what the properties I would need to change are.
Thanks!
amber
View 5 Replies
View Related
Mar 5, 2008
Greetings
My Fuzzy Lookup task works beautifully when it generates the lookup index every time it runs, but as I'm planning on running this hundreds of times I'd like it to maintain the index via the trigger. However when it attempts to install the trigger via sp_FuzzyLookupTableMaintenanceInstall I get:
Description: "A .NET Framework error occurred during execution of user-defined routine or aggregate "sp_FuzzyLookupTableMaintenanceInstall":
System.Data.SqlClient.SqlException:
Error number 8101 is invalid.
(I've not included the full stack trace as I figured this would be enough)
The table currently has an After Insert and After Update trigger. CLR integration is enabled in this database instance. Is there some other option I need to set somewhere?
Thanks!!
View 17 Replies
View Related
May 29, 2007
Hello,
For one of my SSIS projects that does a fuzzy lookup on a table, I opted to create an index and
to maintain the stored index. The index got created and subsequent project execution was able to
use that index.
Now I want to update certain rows in that table. When I run the update statement I get the following error.
How can I retain the index and still be able to update the table?
update location_stage set batchid = 'APR07N'
where batchid is null and eventid = '20070528020041';
Msg 6549, Level 16, State 1, Procedure sp_FuzzyLookupTableMaintenanceInvoke, Line 0
A .NET Framework error occurred during execution of user defined routine or aggregate 'sp_FuzzyLookupTableMaintenanceInvoke':
System.Data.SqlClient.SqlException: Transaction is not allowed to roll back inside a user defined routine, trigger or aggregate because the transaction is not started in that CLR level. Change application logic to enforce strict transaction nesting.
System.Data.SqlClient.SqlException:
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnectionSmi.ExecuteTransaction(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction)
at System.Data.SqlClient.SqlInternalTransaction.Rollback()
at System.Data.SqlClient.SqlTransaction.Rollback()
at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.TranWrap(DataCleaningOperation c)
. User transaction, if any, will be rolled back.
The statement has been terminated.
View 3 Replies
View Related
Mar 2, 2008
Hi All,
Is there a way the fuzzy lookup or grouping can be trained so that similarities and confidence values rely on previously matched strong links?
For example: I can link 80% of my two datasets using one strong identifier (say phone #) which I trust. My goal then, is to use the probability of matching of the rest of my linking fields (say Name,Address,Gender,DOB) in a "matched by phone number" pair to train a fuzzy lookup task to be done on the unlinked 20% of the datasets.
This "training set" would in theory influence the similarity and confidence values of the fuzzy output since each linking column would carry a different weight or contribution towards a confident match.
Does anyone out there knows how to do this in practice in SSIS?
View 1 Replies
View Related
Jul 29, 2014
I have a fuzzy lookup in Integration Services Packages that does not seem to run. I am pulling data from a table in sql server 2008 R2 and comparing results to data from another table in sql server (same database & instance)  using a fuzzy lookup for match similarities between the data sets. When my data flow task reaches my fuzzy lookup, a DOS box pops up for a second and then my packages finishes with a message of "Finished. Cancelled". The last message in my execution results displays: "Information: Execute phase is beginning". Again, there are no excel files being processed or utilized in this package.  I've tried running my packages both in 32 bit and 64 bit mode.
View 11 Replies
View Related
Nov 8, 2007
I am having trouble programmatically creating a fuzzy lookup package. I have successfully build 90% of it, along with a different Fuzzy Grouping package, but have hit a wall with regards to the pass through columns of the fuzzy lookup component.
The last line of code below always fails. Prior to the below code I've setup my fuzzy lookup component, instantiated it (instance variable), and attached it's input to the output of an ole db source. At this point, the only part that I haven't been able to figure out is the code below -- this is where I'm trying to add pass through columns to the output of my fuzzy lookup component. ImportId and ImportRowId are columns that are in my OLE DB source and thus, in the input of my fuzzy component. Below I try to get them to pass through so that they're in the output, and the last line fails. When I step through code, I see that the outputColumn.LineageID is in fact the correct value (I compared it with a package i created manually and the value when debuggins is exactly the same value as the xml from the manually built version).
Code Block
IDTSOutput90 fuzzyLookupOutput = this.FuzzyLookup.OutputCollection["Fuzzy Lookup Output"];
IDTSOutput90 sourceOutputCollection = this.OleDbSource.OutputCollection["OLE DB Source Output"];
IDTSOutputColumnCollection90 sourceOutputCols = sourceOutputCollection.OutputColumnCollection;
foreach (IDTSOutputColumn90 outputColumn in sourceOutputCols)
{
// pass through columns
IDTSOutputColumn90 col = null;
if (outputColumn.Name == "ImportId" || outputColumn.Name == "ImportRowId")
{
col = instance.InsertOutputColumnAt(
fuzzyLookupOutput.ID, fuzzyLookupOutput.OutputColumnCollection.Count, outputColumn.Name, "");
col.SetDataTypeProperties(
outputColumn.DataType, outputColumn.Length, outputColumn.Precision, outputColumn.Scale, outputColumn.CodePage);
instance.SetOutputColumnProperty(
fuzzyLookupOutput.ID, col.ID, "SourceInputColumnLineageId", outputColumn.LineageID);
}
}
Any thoughts???
View 10 Replies
View Related
Jul 11, 2007
Hi Everyone,
I'm building a package that is using the Fuzzy Lookup data flow item
and I'm receiving an error message that is troubling me!!!
"Fuzzy Lookup The length of input column is not equal to the length of
the reference column that it is being matched against"
If anyone can provide me any insight it would be greatly appreciated!
Regards,
A.Akin
View 4 Replies
View Related
Nov 9, 2007
Below is C# code used to create a FuzzyLookup SSIS package programmatically. It does 95% of what I need it to. The only thing missing that I cannot figure out is how to take a Fuzzy Lookup Input column (OLE DB Output Column) and make it "pass through" the fuzzy lookup component to the OLE DB Destination. In the example below, that means I need the QuarantinedEmployeeId to make it into the destination.
Look in the "Test Dependencies" region below to get instructions and scripts used to set assembly references, create the sample tables used for this example, and insert test data.
Can anyone help me get past this last hurdle? You will see at the end of my Fuzzy Lookup region a bunch of commented out code that I've used to try to accomplish this last problem.
Code Block
using Microsoft.SqlServer.Dts.Runtime;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
namespace CreateSsisPackage
{
public class TestFuzzyLookup
{
public static void Test()
{
#region Test Dependencies
// Assembly references:
// Microsoft.SqlServer.DTSPipelineWrap
// Microsoft.SQLServer.DTSRuntimeWrap
// Microsoft.SQLServer.ManagedDTS
// First create a database called TestFuzzyLookup
// Next, create tables:
//SET ANSI_NULLS ON
//GO
//SET QUOTED_IDENTIFIER ON
//GO
//CREATE TABLE [dbo].[EmployeeMatch](
// [RecordId] [int] IDENTITY(1,1) NOT NULL,
// [EmployeeId] [int] NOT NULL,
// [QuarantinedEmployeeId] [int] NOT NULL,
// [_Similarity] [real] NOT NULL,
// [_Confidence] [real] NOT NULL,
// CONSTRAINT [PK_EmployeeMatch] PRIMARY KEY CLUSTERED
//(
// [RecordId] ASC
//)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
//) ON [PRIMARY]
//GO
//SET ANSI_NULLS ON
//GO
//SET QUOTED_IDENTIFIER ON
//GO
//CREATE TABLE [dbo].[QuarantinedEmployee](
// [QuarantinedEmployeeId] [int] IDENTITY(1,1) NOT NULL,
// [QuarantinedEmployeeName] [varchar](50) NOT NULL,
// CONSTRAINT [PK_QuarantinedEmployee] PRIMARY KEY CLUSTERED
//(
// [QuarantinedEmployeeId] ASC
//)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
//) ON [PRIMARY]
//GO
//SET ANSI_NULLS ON
//GO
//SET QUOTED_IDENTIFIER ON
//GO
//CREATE TABLE [dbo].[Employee](
// [EmployeeId] [int] IDENTITY(1,1) NOT NULL,
// [EmployeeName] [varchar](50) NOT NULL,
// CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED
//(
// [EmployeeId] ASC
//)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
//) ON [PRIMARY]
// Next, insert test data
//insert into employee values ('John Doe')
//insert into employee values ('Jane Smith')
//insert into employee values ('Ryan Johnson')
//insert into quarantinedemployee values ('John Dole')
#endregion Test Dependencies
#region Create Package
// Create a new package
Package package = new Package();
package.Name = "FuzzyLookupTest";
// Add a Data Flow task
TaskHost taskHost = package.Executables.Add("DTS.Pipeline") as TaskHost;
taskHost.Name = "Fuzzy Lookup";
IDTSPipeline90 pipeline = taskHost.InnerObject as MainPipe;
// Get the pipeline's component metadata collection
IDTSComponentMetaDataCollection90 componentMetadataCollection = pipeline.ComponentMetaDataCollection;
#endregion Create Package
#region Source
// Add a new component metadata object to the data flow
IDTSComponentMetaData90 oledbSourceMetadata = componentMetadataCollection.New();
// Associate the component metadata object with the OLE DB Source Adapter
oledbSourceMetadata.ComponentClassID = "DTSAdapter.OLEDBSource";
// Instantiate the OLE DB Source adapter
IDTSDesigntimeComponent90 oledbSourceComponent = oledbSourceMetadata.Instantiate();
// Ask the component to set up its component metadata object
oledbSourceComponent.ProvideComponentProperties();
// Add an OLE DB connection manager
ConnectionManager connectionManagerSource = package.Connections.Add("OLEDB");
connectionManagerSource.Name = "OLEDBSource";
// Set the connection string
connectionManagerSource.ConnectionString = "Data Source=localhost;Initial Catalog=TestFuzzyLookup;Provider=SQLNCLI.1;Integrated Security=SSPI;Auto Translate=False;";
// Set the connection manager as the OLE DB Source adapter's runtime connection
IDTSRuntimeConnection90 runtimeConnectionSource = oledbSourceMetadata.RuntimeConnectionCollection["OleDbConnection"];
runtimeConnectionSource.ConnectionManagerID = connectionManagerSource.ID;
// Tell the OLE DB Source adapter to use the source table
oledbSourceComponent.SetComponentProperty("OpenRowset", "QuarantinedEmployee");
oledbSourceComponent.SetComponentProperty("AccessMode", 0);
// Set up the connection manager object
runtimeConnectionSource.ConnectionManager = DtsConvert.ToConnectionManager90(connectionManagerSource);
// Establish the database connection
oledbSourceComponent.AcquireConnections(null);
// Set up the column metadata
oledbSourceComponent.ReinitializeMetaData();
// Release the database connection
oledbSourceComponent.ReleaseConnections();
// Release the connection manager
runtimeConnectionSource.ReleaseConnectionManager();
#endregion Source
#region Fuzzy Lookup
// Add a new component metadata object to the data flow
IDTSComponentMetaData90 fuzzyLookupMetadata = componentMetadataCollection.New();
// Associate the component metadata object with the Fuzzy Lookup object
fuzzyLookupMetadata.ComponentClassID = "DTSTransform.BestMatch.1";
// Instantiate
IDTSDesigntimeComponent90 fuzzyLookupComponent = fuzzyLookupMetadata.Instantiate();
// Ask the component to set up its component metadata object
fuzzyLookupComponent.ProvideComponentProperties();
// Add an OLE DB connection manager
ConnectionManager connectionManagerFuzzy = package.Connections.Add("OLEDB");
connectionManagerFuzzy.Name = "OLEDBFuzzy";
// Set the connection string
connectionManagerFuzzy.ConnectionString = "Data Source=localhost;Initial Catalog=TestFuzzyLookup;Provider=SQLNCLI.1;Integrated Security=SSPI;Auto Translate=False;";
// Set the connection manager as the fuzzy lookup component's runtime connection
IDTSRuntimeConnection90 runtimeConnectionFuzzy = fuzzyLookupMetadata.RuntimeConnectionCollection["OleDbConnection"];
runtimeConnectionFuzzy.ConnectionManagerID = connectionManagerFuzzy.ID;
// Set up the connection manager object
runtimeConnectionFuzzy.ConnectionManager = DtsConvert.ToConnectionManager90(connectionManagerFuzzy);
// Establish the database connection
fuzzyLookupComponent.AcquireConnections(null);
// Set up the external metadata column
fuzzyLookupComponent.ReinitializeMetaData();
// Release the database connection
fuzzyLookupComponent.ReleaseConnections();
// Release the connection manager
runtimeConnectionFuzzy.ReleaseConnectionManager();
// Get the standard output of the OLE DB Source adapter
IDTSOutput90 oledbSourceOutput = oledbSourceMetadata.OutputCollection["OLE DB Source Output"];
// Get the input of the Fuzzy Lookup component
IDTSInput90 fuzzyInput = fuzzyLookupMetadata.InputCollection["Fuzzy Lookup Input"];
// Create a new path object
IDTSPath90 path = pipeline.PathCollection.New();
// Connect the source to Fuzzy Lookup
path.AttachPathAndPropagateNotifications(oledbSourceOutput, fuzzyInput);
// Get the output column collection for the OLE DB Source adapter
IDTSOutputColumnCollection90 oledbSourceOutputColumns = oledbSourceOutput.OutputColumnCollection;
// Get the external metadata column collection for the fuzzy lookup component
IDTSExternalMetadataColumnCollection90 externalMetadataColumns = fuzzyInput.ExternalMetadataColumnCollection;
// Get the virtual input for the fuzzy lookup component
IDTSVirtualInput90 virtualInput = fuzzyInput.GetVirtualInput();
// Loop through output columns and relate columns that will be fuzzy matched on
foreach (IDTSOutputColumn90 outputColumn in oledbSourceOutputColumns)
{
IDTSInputColumn90 col = fuzzyLookupComponent.SetUsageType(fuzzyInput.ID, virtualInput, outputColumn.LineageID, DTSUsageType.UT_READONLY);
if (outputColumn.Name == "QuarantinedEmployeeName")
{
// column name is one of the columns we'll match with
fuzzyLookupComponent.SetInputColumnProperty(fuzzyInput.ID, col.ID, "JoinToReferenceColumn", "EmployeeName");
fuzzyLookupComponent.SetInputColumnProperty(fuzzyInput.ID, col.ID, "MinSimilarity", 0.6m);
// set to be fuzzy match (not exact match)
fuzzyLookupComponent.SetInputColumnProperty(fuzzyInput.ID, col.ID, "JoinType", 2);
}
}
fuzzyLookupComponent.SetComponentProperty("MatchIndexOptions", 1);
fuzzyLookupComponent.SetComponentProperty("MaxOutputMatchesPerInput", 100);
fuzzyLookupComponent.SetComponentProperty("ReferenceTableName", "Employee");
fuzzyLookupComponent.SetComponentProperty("WarmCaches", true);
fuzzyLookupComponent.SetComponentProperty("MinSimilarity", 0.6);
IDTSOutput90 fuzzyLookupOutput = fuzzyLookupMetadata.OutputCollection["Fuzzy Lookup Output"];
// add output columns that will simply pass through from the reference table (Employee)
IDTSOutputColumn90 outCol = fuzzyLookupComponent.InsertOutputColumnAt(fuzzyLookupOutput.ID, 0, "EmployeeId", "");
outCol.SetDataTypeProperties(Microsoft.SqlServer.Dts.Runtime.Wrapper.DataType.DT_I4, 0, 0, 0, 0);
fuzzyLookupComponent.SetOutputColumnProperty(fuzzyLookupOutput.ID, outCol.ID, "CopyFromReferenceColumn", "EmployeeId");
// add output columns that will simply pass through from the oledb source (QuarantinedEmployeeId)
//IDTSOutput90 sourceOutputCollection = oledbSourceMetadata.OutputCollection["OLE DB Source Output"];
//IDTSOutputColumnCollection90 sourceOutputCols = sourceOutputCollection.OutputColumnCollection;
//foreach (IDTSOutputColumn90 outputColumn in sourceOutputCols)
//{
// if (outputColumn.Name == "QuarantinedEmployeeId")
// {
// IDTSOutputColumn90 col = fuzzyLookupComponent.InsertOutputColumnAt(fuzzyLookupOutput.ID, 0, outputColumn.Name, "");
// col.SetDataTypeProperties(
// outputColumn.DataType, outputColumn.Length, outputColumn.Precision, outputColumn.Scale, outputColumn.CodePage);
// //fuzzyLookupComponent.SetOutputColumnProperty(
// // fuzzyLookupOutput.ID, col.ID, "SourceInputColumnLineageId", outputColumn.LineageID);
// }
//}
// add output columns that will simply pass through from the oledb source (QuarantinedEmployeeId)
//IDTSInput90 fuzzyInputCollection = fuzzyLookupMetadata.InputCollection["Fuzzy Lookup Input"];
//IDTSInputColumnCollection90 fuzzyInputCols = fuzzyInputCollection.InputColumnCollection;
//foreach (IDTSInputColumn90 inputColumn in fuzzyInputCols)
//{
// if (inputColumn.Name == "QuarantinedEmployeeId")
// {
// IDTSOutputColumn90 col = fuzzyLookupComponent.InsertOutputColumnAt(fuzzyLookupOutput.ID, 0, inputColumn.Name, "");
// col.SetDataTypeProperties(
// inputColumn.DataType, inputColumn.Length, inputColumn.Precision, inputColumn.Scale, inputColumn.CodePage);
// fuzzyLookupComponent.SetOutputColumnProperty(
// fuzzyLookupOutput.ID, col.ID, "SourceInputColumnLineageId", inputColumn.LineageID);
// }
//}
#endregion Fuzzy Lookup
#region Destination
// Add a new component metadata object to the data flow
IDTSComponentMetaData90 oledbDestinationMetadata = componentMetadataCollection.New();
// Associate the component metadata object with the OLE DB Destination Adapter
oledbDestinationMetadata.ComponentClassID = "DTSAdapter.OLEDBDestination";
// Instantiate the OLE DB Destination adapter
IDTSDesigntimeComponent90 oledbDestinationComponent = oledbDestinationMetadata.Instantiate();
// Ask the component to set up its component metadata object
oledbDestinationComponent.ProvideComponentProperties();
// Add an OLE DB connection manager
ConnectionManager connectionManagerDestination = package.Connections.Add("OLEDB");
connectionManagerDestination.Name = "OLEDBDestination";
// Set the connection string
connectionManagerDestination.ConnectionString = "Data Source=localhost;Initial Catalog=TestFuzzyLookup;Provider=SQLNCLI.1;Integrated Security=SSPI;Auto Translate=False;";
// Set the connection manager as the OLE DBDestination adapter's runtime connection
IDTSRuntimeConnection90 runtimeConnectionDestination = oledbDestinationMetadata.RuntimeConnectionCollection["OleDbConnection"];
runtimeConnectionDestination.ConnectionManagerID = connectionManagerDestination.ID;
// Tell the OLE DB Destination adapter to use the destination table
oledbDestinationComponent.SetComponentProperty("OpenRowset", "EmployeeMatch");
oledbDestinationComponent.SetComponentProperty("AccessMode", 0);
// Set up the connection manager object
runtimeConnectionDestination.ConnectionManager = DtsConvert.ToConnectionManager90(connectionManagerDestination);
// Establish the database connection
oledbDestinationComponent.AcquireConnections(null);
// Set up the external metadata column
oledbDestinationComponent.ReinitializeMetaData();
// Release the database connection
oledbDestinationComponent.ReleaseConnections();
// Release the connection manager
runtimeConnectionDestination.ReleaseConnectionManager();
// Get the standard output of the fuzzy lookup componenet
IDTSOutput90 fuzzyLookupOutputCollection = fuzzyLookupMetadata.OutputCollection["Fuzzy Lookup Output"];
// Get the input of the OLE DB Destination adapter
IDTSInput90 oledbDestinationInput = oledbDestinationMetadata.InputCollection["OLE DB Destination Input"];
// Create a new path object
IDTSPath90 ssisPath = pipeline.PathCollection.New();
// Connect the source and destination adapters
ssisPath.AttachPathAndPropagateNotifications(fuzzyLookupOutputCollection, oledbDestinationInput);
// Get the output column collection for the OLE DB Source adapter
IDTSOutputColumnCollection90 fuzzyLookupOutputColumns = fuzzyLookupOutputCollection.OutputColumnCollection;
// Get the external metadata column collection for the OLE DB Destination adapter
IDTSExternalMetadataColumnCollection90 externalMetadataCols = oledbDestinationInput.ExternalMetadataColumnCollection;
// Get the virtual input for the OLE DB Destination adapter.
IDTSVirtualInput90 vInput = oledbDestinationInput.GetVirtualInput();
// Loop through our output columns
foreach (IDTSOutputColumn90 outputColumn in fuzzyLookupOutputColumns)
{
// Add a new input column
IDTSInputColumn90 inputColumn = oledbDestinationComponent.SetUsageType(oledbDestinationInput.ID,
vInput, outputColumn.LineageID, DTSUsageType.UT_READONLY);
// Get the external metadata column from the OLE DB Destination
// using the output column's name
IDTSExternalMetadataColumn90 externalMetadataColumn = externalMetadataCols[outputColumn.Name];
// Map the new input column to its corresponding external metadata column.
oledbDestinationComponent.MapInputColumn(oledbDestinationInput.ID, inputColumn.ID, externalMetadataColumn.ID);
}
#endregion Destination
// Save the package
Application application = new Application();
application.SaveToXml(@"c:TempTestFuzzyLookup.dtsx", package, null);
}
}
}
View 6 Replies
View Related