Hello,
I'm trying to clean my data using fuzzy lookup algorithm though SSIS, but i get null values everywhere. This is what i did:
I applied the fuzzy lookup in a table (tblValues). As source table i have the tblValues, and as reference table in Fuzzy Lookup i have the tblValues as well, resulting null values in all fields/columns.
Do i have to create my own reference table? If yes, how do i do that and what values will i have in this table?I didn't understand how the reference table must be in order the algorithm to work. Any suggestions?
I am trying to implement fuzzy lookup transaformation in my ssis package. However, I want to understand the basic logic behind this component. what is the algorithm that is used here and how it works (in a simple languange)Â ?
I have a requirement in the application that. I have columns in the following format:- Col1 Col2 Col3 1 A1 A2 1 A1 B2 2 B1 C2 1 A1 D3 These type of contents the table have contents . What I require id if for any row if ‘Col1’ and ‘Col2’ values are same then the n the query should show only one record with ‘Col3’ values concatenated as comma separated array. For the above e.g. the result should be. Col1 Col2 Col3 1 A1 A2, B2, D3 2 B1 C2
I have this code below and it appears to work, but I wondered if there was a better way.
Basically I have a table with some sales. The amount sold is stored in a varchar (for a very stupid reason) called ticketNo, the date of the sale is stored in txDate (thankfully this is a date).
We need to report out the total amount sold for a period in weekly groupings with the week starting on Monday. If a week starts within the period specified but does not finish we extend the period to include the whole week.
As I said the code below appears to work (no I have not tested extensively), but it looks like it will be pretty resouce intensive.
set @minTxDate = '02-Jul-2007' set @maxTxDate = '31-Aug-2007'
declare @StartOfYear DateTime declare @LocalMaxTxDate DateTime declare @WeekDay int declare @StartWeek int declare @EndWeek int
set @StartOfYear = '01-Jan-' + Convert(varchar(4),Year(getDate())) set @WeekDay = DatePart(weekday,@StartOfYear) set @StartWeek = DatePart(week, @minTxDate) set @EndWeek = DatePart(week, @maxTxDate)
set @LocalMaxTxDate = DateAdd(d,10,@maxTxDate)
set DATEFIRST 1 select DateAdd(d,((DatePart(week, tmpSale.txDate)-1)*7)+@WeekDay-1, @StartOfYear) 'FirstDayOfWeek', DatePart(week, tmpSale.txDate) 'WeekOfSale', COALESCE(sum (Convert(decimal, ticketNo)),0) 'TonnesSold' from (select Sale.txDate, Sale.ticketNo from Sale where Sale.txDate between @minTxDate and @LocalMaxTxDate) tmpSale where DatePart(week, tmpSale.txDate) between @StartWeek and @EndWeek group by DatePart(week, tmpSale.txDate), DateAdd(d,((DatePart(week, tmpSale.txDate)-1)*7)+@WeekDay-1, @StartOfYear)
how I could achieve a ranking number, where I have some grouping involved.I want to have the groups in sets of 4, but if I get to a new person the ranking increases, even if I haven't filled the group. For example:
Name Year Sales Ranking -------------------- ----------- ----------- -------------------- John Smith 2009 1296 1 John Smith 2010 1296 1 John Smith 2011 1296 1 John Smith 2012 1296 1 John Smith 2013 1296 2
[code]....
So, when I get to John's 5 year, the Rank increased. I then only had two more rows before I get to Nina, but the Rank number goes up, as the details are for a different person from the prior user.
Select *, DENSE_RANK() over (Order by Name) + ((RANK() over (Partition by Name Order by Year) - 1) /4) as Ranking from #Test Order by Name, Year
Drop Table #TestThis works for John, but Nina then has a Ranking of 2 for her only year, where as she should be 3. Steve also then has a value one too low for all of his ranks.
I've been looking into ways to accomplish a fuzzy search and SSIS makes that possible if I want to do a bulk import or something like it. But what it I just want to look stuff up at any given time not haveing to run the package?
Is it possible to expose the fuzzy lookup outside of SSIS to for example t-sql?
Here's an example: I want to lookup the music artist "Notorious BIG" but in the database it is "Notorious B.I.G." if I use the SSIS fuzzy lookup I basically get what I'm looking for. But how would I call this from a web application? So then I tried Full text search but this doesn't really work out as well.
Will I have to re-write the logic that the fuzzy lookup uses to enable it to work? i.e. using Full Text Indexes and FreeTextTable, ContainsTable, SoundEx and the like to somewhat even come close to what the Fuzzy Lookup has?
I am trying to come up with a Matrix report using RS 2005 that looks like this one below. I have row data: question, columns data: Ethnicity and Gender, detail data: Gender average response and Ethnicity average reponse. I utlilized Report Wizard to create a report in a Matrix format , however I wasn't able to have Gender and Ethnicity column on the same line. The report wizard groups columns so Gender is a subset of Ethnicity or Ethnicity is a subset of Gender. How can is solve this issuee so that Gender and Ethnicity data is presented in a Matrix format and Gender and Ethnicity are not part of each other.
Thank you!
F M Asian Multi-cultural Non-resident Unknown White Black Hispanic Native American
a. Worked on a paper or project that required integrating ideas or information from various sources 3.95 3.99 3.54 4.50 3.28 4.20 4.18 4.04 3.92 3.90
b. Used library resources 4.26 4.09 4.12 4.33 4.10 4.33 4.26 4.22 4.10 4.10
c. Prepared multiple drafts of a paper or assignment before turning it in 3.97 3.76 3.80 4.50 3.58 3.86 3.95 4.09 4.00 4.00
When defaulted to Visible or Invisible, +/- displays correctly. But when using the Expression, + is always showing as the default even when a group is expanded. Is this expected? TIA.
Dear people, I have a problem with a report where users want to see transacctions grouped by Account number and type(e.g: incomes and payments).
And so I use a table object where I put 2 groups: 1) Group by Account Number: Field!Account.Value 2) Group by Transaction Type: Field!Type.Value I put in the detaill row group by: Field!Type.Value and Field!Account.Value.
When I execute the report only show the first row per transaction type, but the dataset has some rows per transaction type.
I€™ve created with the help of some great people an SSIS 2005 package which does the follow so far:
1) Takes an incoming txt file. Example txt file: http://www.webfound.net/split.txt
The txt file going from top to bottom is sort of grouped like this Header Row (designated by €˜HD€™) Corresponding Detail Rows for the Header Row €¦.. Next Header Row Corresponding Detail Rows
€¦and so on
http://www.webfound.net/rows.jpg
2) Header Rows are split into one table, Maintenance Detail Rows into another, and Payment Detail Rows into a third table. A uniqueID has been created for each header and it€™s related detail rows to form a PK/FK relationship as there was non prior to the import, only the relation was in order of header / related rows below it when we first started. The reason I split this out is so I can massage it later with stored proc filters, whatever€¦
Now I€™m trying to somehow bring back the data in those table together like it was initially using a query so that I can cut out each of the Header / Detail Row sections into their own txt file. So, if you look at the original txt file, each new header and it€™s related detail rows (example of a cut piece would be http://www.webfound.net/rows.jpg) need to be cut out and put into their own separate txt file.
This is where I€™m stuck. How to create a query to combine it all back into an OLE DB Souce component, then somehow read that souce and split out the sections into their own individual txt files.
The filenames of the txt files will vary and be based on one of the column values already in the header table.
Hi, My process passing 1,000,000 rows to a data flow with about 20 lookups to get the keys that I wantted . Most lookups have small number of rows except one with over 5,000,000 rows. I cannot get the process to run (the process hanged) probably because of memory issue. Any clue where/how I can tune it. Thanks
I want to do a lookup on date column. My lookup date is of type smalldatetime, and my date is of type datetime (date with time component). My lookup is failing because of incompatible data types.
How do I perform the lookup with date columns having date and time components?
Im having a little diff speeding up the below process
I have two tables table a contains 300,000 rows which are unique however the identifier can appear more than once. The fields im interested in are the identifier and a date/time field.
Table two also contains an identifier and a date field and again can contain multiple instances of the identifier with a variyity of dates.
For each row in table a i want to review the date and look up table b for the first date greater then or equal to the date linking by the identifier.
i have managed to do this via the code below however it takes 45 mins and i want to speed this up.
Select a.*, (select Min(DateB) as DateB From #tableB b where a.identifier = b.identifier and b.DateB >= a.DateA) asDATE From #TableA a
I have a source table with few million rows in it. As a part of transformation, I need around 10 lookups in 10+ different tables, all of them having few million rows each. I am looking for an approach that would be reasonably speedy and easy to manage future changes.
Here are some of the things that I have tried... (1) If I implement as lookup components, they cause developer machine to go really slow and takes forever to run. (2) I tried having OLE DB Source query to fetch required data up front. But the source query becomes very complicated which will be even harder for future changes. And this big query cause SQL Server to become unresponsive. (3) Update queries on target table are also causing server to be unresponsive.
What would you guys suggest for this type of implementation?
I have some SQL experience, but nothing past basic commands. I'm trying to take some data held by an application to use as CSV import into another application.I have two tables from an application, one holds references made in another.The first tables holds details about a person:
field1=name field2=age field3=country
Joe,50,1
Country is held as a number, then there is another table that holds all the countries:
Hi, We use lookups to join a few huge tables in SSIS (each has more than 40 million rows). The process took almost two days to complete when we select partial load on lookups. It stops/locks if we select full load on lookups.
We have a 32bit server so SSIS uses only 2-3GB of available memory no matter how big RAM we have. It seems the best solution for my problem is to move to 64bit server so SSIS uses up to 16GB of Ram.
For now I am researching for a remedy solution to get better performance from our current environment while we are waiting for the big server.
I’d like to hear your thoughts and options that may improve the performance of our package. Dose partitioning help? What else could be helpful?
I have a dimension table for Retail Order Size. Each row in the dimension has a Starting Value and Ending Value column pair. In TSQL, the correct RetailOrderSize key is found by using the BETWEEN statement, like so:
SELECT RetailLevelKEY
FROM dbo.DimRetailOrderSize
WHERE @Sec1Retail BETWEEN StartingValue AND EndingValue
Is there a Data Flow Task Transformation in SSIS that replicates this functionality, or some other way of getting to the same answer in SSIS?
When do a fact table load...I have to perform lookups against the dimension tables. The dimensions tables I have support slow changes, however, and thus have multiple rows for a single legacy key under different effective start and end dates. In order to do this lookup, I have to not just join on the legacy key, but also validate that the date of the transaction I'm loading is between the effective date range of the dimension item.
It seems the Lookup task only supports equijoins. Am I missing something here? How is this accomplished if you can use greater than or equal to and less than type join conditions?
Ive got an ETL process I have written which takes about 10 million rows from a staging database and loads it into production database with an INSERT statement. The INSERT statement makes a function call to retrieve the surrogate key for each row. The function looks in a replicated copy of our production database so no load is on our production environment during this time.
So: INSERT INTO foo(...) SELECT name, address, zip, dbo.fnGetSurrKey( name, address)
It took about 12hrs to insert 6 million rows last night and Im wondering if there is a better way of doing this. Maybe a multithreaded way like SSIS might have.
Assuming my function is optimized as much as possible, does anyone have any tips for speeding this up?
Also, the machine this ran on has 16gb of RAM but was setup to use only 2GB during this process. I have already changed it to 12gb and restarted the process a week ago, but the change doesnt take affect until you reboot. Would I see a significant performance increase from that?
So I have three lookups in a row in my data flow. Basically they are doing data quality checks for me using a reference table.
I want to be able to take the error flows of the three lookups and merge them together (union all) so that I can insert the "errors" (or non matches) into a table.
Can't do it. Because SSIS deems non-matches as "errors" you automatically get the errorCode and errorColumn fields. When you try to union a lookup error output with another lookup's error output, you can't do it.
What I would like to see is a lookup act more like a conditional statment where you have three outputs of a lookup table: match found, no match found, and error. Either that, or I'd like to be able to edit the names of the errorCode and errorColumn fields.
Am I missing something here, or do I need to just add an OLE destination for each lookup error flow when I only want one? 'Course the problem then is that I want to count the number of rows that are in "error" across all of the lookups.
In many of my packages I have to translate an organizational code into a surrogate key. The method for translating is rather convoluted and involves a few lookup tables. (For example, lookup in the OrgKey table. If there is a match, use that key; if not, do a lookup of the first 5 characters in the BUKey table. If there is a match, use that key; if not, do a lookup of the first 2 characters... You get the idea.)
Since many of my packages use this same logic, I would like to consolidate it all into one custom transformation. I assume I can do this with a script transform, but then I'd lose all the caching built into the lookup transforms.
Should I just bite the bullet, and copy and paste the whole Rube Goldberg contraption of cascading lookup transforms into each package? Or is there a better solution I'm overlooking?
In the dataflow of my package, I must check from one table whether a row exists, and if that row exists, I should get some other row from another table, and update that..
I think to check whether a row exists, i should use "Look Up"
But cant we pass parameters to LookUP?
I am trying to use this SQL:
SELECT count(*) FROM ServicePackets where ID = ? and CHANGEDATE > ? and status = 1
I should get if that row exists or not only... (true or false)
OK I have this table I am grbbing from Oracle and I need to take selected columns and do a value lookup against another table: IE Here is a list of fields I get from Oracle:
In this example it is only 2 but in other conversions it could be 20 or more... Now here is my select statement for each Field (The ? being the FIELD from before IE Status or Responsibility):
The is for Status:
Code Snippet SELECT VALUE AS STATUS_VALUE FROM Field_Values WHERE (NAME = 'Project Name') AND (ENUMID = ?) AND (FIELDNAME = 'Status')
This is for Responsibility:
Code SnippetSELECT VALUE AS RESPONSIBILITY_VALUE FROM Field_Values WHERE (NAME = 'Project Name') AND (ENUMID = ?) AND (FIELDNAME = 'Responsibility')
So for this the way I am doing it now is I have 2 "Lookup" Components setup... It works fine... However as I said when I get say 20 or so it gets really tiresome. I was wondering if I could feed it a EXCEL or XML file saying these are the fields that need a value lookup where:
FIELD - Project - FIELDNAME VALUE - OUTPUT VALUE
So with this example I would have a file saying something like this: STATUS - Project Name - Status - STATUS_VALUE RESPONSIBILITY - Project Name - Responsibility - RESPONSIBILITY_VALUE
Then it runs whatever and returns the *_VALUE for each row it goes through... Any suggestions?
Why can you not turn off the caching in a Lookup against Oracle?
I have an exceedingly complicated SQL statement like this -
SELECT OBJECT_ID, OBJECT_CODE FROM OBJECT_TABLE
If I turn off the cache for a lookup I get bombarded with this rubbish-
Error 8 Validation error. DFT Load STATUS: LKP Get RESULT_NO [128]: An OLE DB error has occurred. Error code: 0x80040E14. An OLE DB record is available. Source: "Microsoft OLE DB Provider for Oracle" Hresult: 0x80040E14 Description: "ORA-00933: SQL command not properly ended ". Update.dtsx 0 0
Error 9 Validation error. DFT Load DAY_STATUS: LKP Get RESULT_NO [128]: OLE DB error occurred while loading column metadata. Check SQLCommand and SqlCommandParam properties. Update.dtsx 0 0
I have tried modifying the Cache SQL Statement as well, but to no avail. I am using the MSDAORA.1 provider against "Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production".
I had used lookup on DIM table to get my SUK and if I use union transformation to get the out put from each lookup and then loading the data with some condition the data in my fact is not loading in a proper format.
The union transformation is splitting the out put in to different records
Please do inform me about which transformation should be used to get the data from lookup tables.
Or please do inform me the approach to load the fact table in SSIS.
I€™m basically INFORMATICA resource and I€™m implementing in terms of INFORMATICA
Our existing DW's ETL was written in a very complex fashion by the previous team. They use DTS package lookups to read a row in the Source SQL Server database see if that row exists in the taget SQL Server database. If the row does not exist, they use ActiveX scripts to INSERT the row in the target SQL Server database. If it exists, they update the row on the target side. How would you do this in SSIS? Apologize if this sounds like a basic question, however, I would have done this via Stored Procedures or SQL Scripts especially since it involves SQL Servers alone. Appreciate any help.
I notice that SQL Server 2005 creates worktables where SQL 2000 does not. Often these work tables appear in STATISTICS IO, but they show a 0 scan count and 0 logical reads. These worktables often appear to be substituted for bookmark lookups.
Has the optimizer decided to use worktables instead of bookmark lookups (often resulting in a higher cost plan)?