We have a SSIS package which includes Fuzzy Grouping in Data Flow. It takes two columns from source table and saves outputs in different table with match score etc. Following is the way we are doing it:
1. Load required data from table using OLEDB connection (source)
2. Sort the data
3. Apply Fuzzy grouping (using dedicated database instead tempdb and MinSimilarity = 0.6)
4. Send to destination table using OLEDB connection (destination)
In input table we have millions of records. It takes too long to execute and even sometime it fails after running 12 hours. Any suggestions for performance improvement are welcomed.
I managed to get fuzzy grouping working. The relevant output (_key_in and _key_out) are stored in a new table that is a copy of the old table + fuzzy grouping columns.
How do i get SSIS to store the _key_in and _key_out in the original table? The new matching column _key_out refers to the new key: _key_in. How could i get SSIS translate that to a matching column that refers to my original key?
Hi - we have been evaluating using Fuzzy Grouping and Lookup for maintaining our large list of customer records. Initial testing with Grouping on about 300K records went great but now with a larger sample of 7.3 million records we are running into problems. It doesn't appear to be system limitation - the index is built reasonably quickly and without errors but when it starts the matching we get these errors:
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: The ProcessInput method on component "Fuzzy Lookup" (86) failed with error code 0x8000FFFF. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: Thread "WorkThread0" has exited with error code 0x8000FFFF.
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: Thread "WorkThread1" received a shutdown signal and is terminating. The user requested a shutdown, or an error in another thread is causing the pipeline to shutdown.
[Fuzzy Grouping Inner Data Flow : OLE DB Source [1]] Error: The attempt to add a row to the Data Flow task buffer failed with error code 0xC0047020.
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: Thread "WorkThread1" has exited with error code 0xC0047039.
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: The PrimeOutput method on component "OLE DB Source" (1) returned error code 0xC02020C4. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.
[Fuzzy Grouping Inner Data Flow : DTS.Pipeline] Error: Thread "SourceThread0" has exited with error code 0xC0047038.
One thing we did find is that our test server didn't have SP1 installed and that seemed to help a lot (we were getting buffer errors prior to SP1). One other note - the desination table is populated with all the data but no scoring has been applied to it.
Does anyone have any ideas what could be causing this?
I have an Oracle table called "Party" which contains Party_Id as primary key and have Party_Name, Party_Addr etc., as fields. We have lot more duplicate party details such as (party_name and party_addr) in this table. We are trying to aviod duplicates using FUZZY logic of SSIS.
1. Is any body suggest me how to create package to avoid duplicates using Fuzzy logic for this scenario(Step by step instructions are good for me to understand SSIS).
2. Could you please provide me some samples for FUZZY(Please send me a sample to my email)
I was running a Fuzzy Grouping task on SQL Server Enterprise Edition SP1 without any issues. I then applied SP2 and now that same Fuzzy Grouping is causing a minidump and terminating the process.
First, does anybody know anything about this kind of issue?
Second, I tried to run the minidump file in Visual Studio but I cannot actually run the dump file in Visual Studio as I keep getting the following exception:
Debugging information for 'DtsDebugHost.exe' cannot be found or does not match. No symbols loaded.
Finally, I did obtain a random error on the server itself that displayed the GUID: 58FC39EB-9DBD-4EA7-B7B4-9404CC6ACFAB.
This GUID appears to be tied to a Dr. Watson error but, again, I cannot figure out what process is breaking.
We do not have any Address Cleansing tools and the requirement is we have to cleanse the data, finding the best possible record which has all info and update other records accordingly.
I am Not sure we can do this Fuzzy Grouping Transformation.
I have been struggling with this for quite awhile so any help would be appreciated.
I need to know if there is away to populate the fuzzy grouping control dynamically. I know you programmatically design a package and customize it in C# but for our purposes we would like to control the SSIS package via database settings. When the settings change the package would then act different. Its a simple a package consisting of an Input - fuzzy grouping - conditional split - output. The connections are setup dynamically using parameters, expressions and a script task. Is there anyway I could do a similar thing for Fuzzy Grouping?
I have a few questions about the amounts of resources used by the fuzzy grouping transformation. I am running a little less than 5mil records through a fuzzy grouping that exact matches one column and fuzzy matches one. The server executing the package is a dual-core xeon with 2gb ram, running a default instance of sql 2005 enterprise.
I have been attempting to execute this package for a while now but it keeps erroring out for various reasons. At first, it was from a lack of available memory. I limited the memory usage of sql server to 256mb and set the buffer temp storage path, which alleviated those errors. However, now, my tempdb transaction log is growing significantly. It failed once for not being able to grow and reallocate quickly enough, but enlarging the auto-growth factor fixed that. Then, it filled up the volume the tempdb log was on, so now I have moved it to the san and am about to try again.
I was wondering, does anyone have a general idea on approximate resource usage by fuzzy grouping? Specifically, is there an approximate relation between the number of records grouped and the amount of ram/pagefile required? Also, on the database backend, how big can I expect the tempdb data/log files to get?
I need some advice on fuzzy lookup / grouping design. I have a requirement that, I think, is between lookup and grouping transformations.
In one of our applications, users can enter manually a label for some information in the database. Every month, I will store all the new data in our OLAP DB, and I want to group these labels with a fuzzy logic. Historical data (already loaded) have to be grouped, as well as new data coming every month.
I have no predefined canonical data, so Fuzzy Lookup seems not adapted to my pb. Fuzzy Grouping seems ok, but it would require to put historical data as well as new data as an input of the Fuzzy Grouping Transfo to constitute groups. This seems not efficient to me.
My question is how to calculate the similarity by using SQL query, example LIKE % , order by.....? Now i'm doing a function same like fuzzy grouping but i do not know how to get the answer, mean how they get match with those selected row of data.
Hope my question is clear. How to write the correct query? What should i do? I 'm newbie in Integration Services, so i need ur explaination in step by step if there hv correction.
I am looking forward to hearing from you shortly and thanks a lot in advance.
I have recently decided to dedupe my data but i am having a problem after running fuzzy grouping with the query on updating which duplicate to keep
_key_in is unique, _key_out is the duplicates so for example:
_key_in , _key_out , name , score , dedupe 1 , 1 , ron , 10 , purge 2 , 1 , ronn , 15 , keep 3 , 3 , john , 5 , keep 4 , 4 , matt , 15 , keep 5 , 4 , mat , 10 , purge 6 , 4 , matt , 15 , purge
I want to keep the _key_out with the higher score by setting the field de_dupe to 'keep' and the remainder to 'purge'. The score can also be the same within a duplicate so in the case it is the same i just need to keep one it doesnt matter which one. The query i have below nearly works but it marks duplicates with the same score as keep.
Code: UPDATE b SET b.dedupe_result = 'keep' FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] b INNER JOIN
I've seen one other post on this topic from October 2005 and I thought I'd bring it up again. I've a Fuzzy Grouping component in my data flow. The output data from it appears to be the result of records spliced into other records. This includes pass-through columns, not merely "clean" or similarity columns. For example (I've added the suffixes for illustrative purposes):
I was wondering how Fuzzy Grouping deals with and handles first name similarities. Is there a way to configure it so that Anthony = Tony, Bill = William, etc€¦? I created a simple package with several rows containing similar first names and ran the fuzzy grouping on the first name column. I received only one possible duplicate of Will = William which was at 56%. I lowered the threshold down to 1% and still only one match.
Now I understand and appreciate the reasons for this but was wondering if this type of situation was considered and a way of dealing with it is available.
Is there a way the fuzzy lookup or grouping can be trained so that similarities and confidence values rely on previously matched strong links?
For example: I can link 80% of my two datasets using one strong identifier (say phone #) which I trust. My goal then, is to use the probability of matching of the rest of my linking fields (say Name,Address,Gender,DOB) in a "matched by phone number" pair to train a fuzzy lookup task to be done on the unlinked 20% of the datasets.
This "training set" would in theory influence the similarity and confidence values of the fuzzy output since each linking column would carry a different weight or contribution towards a confident match.
Does anyone out there knows how to do this in practice in SSIS?
I have tried to process > 3 million Fuzzy grouping records on two different servers with no success. 3 mill works but anything above 4 mill doesn't. Some background:
We are trying to de-dup our customer table on: name (.5 min), address1 (.5 min), city (.5 min), state (exact). .8 overall record min score. Output includes additional fields: customerid, sourceid, address2, country, phonenumber Without SP1 installed I couldn't even get a few hundred thousand records to process Two different servers - same problems. Note that SSIS and SQL Server are running locally on both The higher end server has 4GB RAM, the other 2.5 GB RAM. Plenty of free disk space on both SQL Server is configured to use 2 GB of RAM max The page file is currently at 15GB
After running a number of test on both servers trying different batch sizes etc. the one thing I noticed is that it seems to always error out when SSIS takes over and starts chewing up all the available RAM. This happens after the index is created and SSIS starts "warming caches". On both servers SQL Server uses up about 1.6GB of RAM at this point while SSIS keeps taking over RAM until all physical RAM is used up.
Some questions:
Has anyone been able to process more then 3 million records and if so what is your hardware configuration? Should we try running SSIS from a different server so it has access to the full amount of physical RAM? (so it doesn't have to fight for RAM with SQL Server) Should we install Win 2003 Enterprise Server so we can add more RAM? Any ideas why switching to the page file might be causing errors?
Will the fuzzy grouping task match a null value to an empty string (or spaces)? I've got 5 columns I'm matching on, and one of them may be null for certain rows but an empty string for others. Given the 4 other columns may match, will this difference stop similar columns being grouped together?
(Someone's modified my grouped data since it was deduped, which takes a while, and I'm hoping for a quick answer on this).
I have a table that I need to identify similarities so I'm running a Fuzzy Grouping Process. I'm getting the follow errors and I can't identify the problema since all the fields are varchar, except for the first that is int but not use in the fuzzy.
select MSSEndCustomerTPID , orgname , address1 , cityname , statename , countryname from [sales].[vw_Fact_VolumeSales] a inner join [GMOFBI].[dbo].[vw_Dim_MSS_Organization] b on a.EndCustomerOrganizationKey=b.MSSOrganizationKey
Hello, I am just wondering if someone out there has tried some Fuzzy matching on databases of large scale i.e - about 20 million contact records. Suppose I wanted to perform matching/ grouping to 10 000 incoming messages. How fast does this usually take? What is the dependence on the number of fields chosen for the match?
We did some "at scale" fuzzy lookup tests today and were rather disappointed with the performance. I'm wanting to know your experience so I can set my performance expectations appropriately.
We were doing a fuzzy lookup against a lookup table with 25 million rows. Each row has 11 columns used in the fuzzy lookup, each between 10-100 chars. We set CopyReferenceTable=0 and MatchIndexOptions=GenerateAndPersistNewIndex and WarmCaches=true. It took about 60 minutes to build that index table, during which, dtexec got up to 4.5GB memory usage. (Is there a way to tell what % of the index table got cached in memory? Memory kept rising as each "Finished building X% of fuzzy index" progress event scrolled by all the way up to 100% progress when it peaked at 4.5GB.) The MaxMemoryUsage setting we left blank so it would use as much as possible on this 64-bit box with 16GB of memory (but only about 4GB was available for SSIS).
After it got done building the index table, it started flowing data through the pipeline. We saw the first buffer of ~9,000 rows get passed from the source to the fuzzy lookup transform. Six hours later it had not finished doing the fuzzy lookup on that first buffer!!! Running profiler showed us it was firing off lots of singelton SQL queries doing lookups as expected. So it was making progress, just very, very slowly.
We had set MinSimilarity=0.45 and Exhaustive=False. Those seemed to be reasonable settings for smaller datasets.
Does that performance seem inline with expectations? Any thoughts to improve performance?
I've been looking into ways to accomplish a fuzzy search and SSIS makes that possible if I want to do a bulk import or something like it. But what it I just want to look stuff up at any given time not haveing to run the package?
Is it possible to expose the fuzzy lookup outside of SSIS to for example t-sql?
Here's an example: I want to lookup the music artist "Notorious BIG" but in the database it is "Notorious B.I.G." if I use the SSIS fuzzy lookup I basically get what I'm looking for. But how would I call this from a web application? So then I tried Full text search but this doesn't really work out as well.
Will I have to re-write the logic that the fuzzy lookup uses to enable it to work? i.e. using Full Text Indexes and FreeTextTable, ContainsTable, SoundEx and the like to somewhat even come close to what the Fuzzy Lookup has?
I'm really stumped on this one. I'm a self taught SQL guy, so there is probobly something I'm overlooking.
I'm trying to get information like this in to a report:
WO# -WO Line # --(Details) --Work Order Line Detail #1 --Work Order Line Detail #2 --Work Order Line Detail #3 --Work Order Line Detail #etc --(Parts) --Work Order Line Parts #1 --Work Order Line Parts #2 --Work Order Line Detail #etc WO# -WO Line # --(Details) --Work Order Line Detail #1 --Work Order Line Detail #2 --Work Order Line Detail #3 --Work Order Line Detail #etc --(Parts) --Work Order Line Parts #1 --Work Order Line Parts #2 --Work Order Line Parts #etc
I'm unable to get the grouping right on this. Since the line details and line parts both are children of the line #, how do you do "parallel groups"?
There are 4 tables:
Work Order Header Work Order Line Work Order Line Details Work Order Line Requisitions
The Header has a unique PK. The Line uses the Header and a Line # as foreign keys that together are unique. The Detail and requisition tables use the header and line #'s in addition to their own line number foreign keys. My queries ends up looking like this:
It probobly isn't best practice, but I'm kinda new so I need some guidance. I'd really appreciate any help! Here's my query:
SELECT [Work Order Header].No_ AS WO_No, [Work Order Line].[Line No_] AS WOL_No, [Work Order Requisition].[Line No_] AS WOLR_No, [Work Order Line Detail].[Line No_] AS WOLD_No FROM [Work Order Header] LEFT OUTER JOIN [Work Order Line] ON [Work Order Header].No_ = [Work Order Line].[Work Order No_] LEFT OUTER JOIN [Work Order Line Detail] ON [Work Order Line].[Work Order No_] = [Work Order Line Detail].[Work Order No_] AND [Work Order Line].[Line No_] = [Work Order Line Detail].[Work Order Line No_] LEFT OUTER JOIN [Work Order Requisition] ON [Work Order Line].[Work Order No_] = [Work Order Requisition].[Work Order No_] AND [Work Order Line].[Line No_] = [Work Order Requisition].[Work Order Line No_]
Is there a built in capability in Sql server 2005 to do a search which can handle spelling errors. for eg. We are doing a search for "hanovr" and our database contains "hanover" . In cases when there is a spelling error searching using LIKE,CONTAINS,FREETEXT are not giving me the results. Is there an out of the box solution for this problem. Please Advice.
How do I do a fuzzy search? If I have a table of full names, I'd like the user to be able to do a search and find the record, "Charles Montgomery Burns" with "Monty Burns" or "Montgomry" (mispelling).
Every major web site does this kind of thing (Amazon, Google, etc).
Someone suggested SOUNDEX, but this really doesn't fit the bill. Misspellings often don't use the same sound signature as the originals. Plus, that doesn't handle multi-word searchable texts very well.
Others have suggested tries or suffix trees. If I went this route, wouldn't I have to preload all data out of the database and into this custom structure upon app startup? Is there any way around that? Also, this solution seems like it would require a lot of dev time (building a custom suffix tree with fuzzy lookup capabilities).
Is there a commonly known and acceptable solution to this?
(sorry, also posted to MySQL group; I'm using both databases so a solution in either would be satisfactory)
I am using a fuzzy lookup to cleanse data from a sales line details table, during the import process. The sales order line details contains a filed called 'reference' and this is compared to a field called 'category' in another table. Using data viewers to check through the cleansing process, I notice that the fuzzy lookup doesn't seem to match i.e. tbl.salesline.reference = 'I3' -> tbl.sales.category ='I03' the above is OK, but the lookup also returns the following tbl.salesline.reference = 'I9' -> tbl.sales.category ='I01' The value I9 doesnt exist, and is miskeyed by user entry, and should have been 'I99'. I would have expected the fuzzy lookup to pickup the I99 value as at least two of the chrs are matching, but no, it picks the first 'I*' in the table. If I expand the fuzzy lookup to return more results, i.e. 5 per record, then it returns the first 5 results....I01, I02 I03 and so on. Is there a way of improving the fuzzy lookup itself?
Hi all, I have been trying for a while now to clean some data that containes duplicate data using fuzzy grouping. I can get as far as identifying the duplicate data using fuzzy grouping but how do I get it out so I can insert non duplicate data a dimension table1?
What I am also stuck with is how do u set the data that isn't duplicate in the table1 as well, or is this done in the same step. Please help, deadlines are creeping in on me
The enterprise edition of SQL server includes some advanced BI features, for example the fuzzy lookup feature of IS. If the IS package lives on an enterprise edition of SQL server and the database the package it is targeting lives on a standard edition of SQL server can the advanced features be used? Can you run a fuzzy look against a database on a standard edition of SQL server when th IS package lives on an enterprise edition of SQL server? THANKS!