Requirement:
I have a simple package with one dataflow task. In that I need to read from a sql table and for every row in that table loop through n times and generate new output rows based on certain conditions (which are best evaluated in custom script as they are rather complex). Hence, if I have 100 rows in the table as my input, I may end up with 100*n rows as output.
My Design:
To implement this I have used an OLE DB Source which outputs to a Script Transform (ST). In the ST I intend to loop through in custom code and generate new rows using the .AddRow feature when I need new rows. This ST then feeds into another OLE DB Destination which writes the data to the table. Simple!
I am using the default buffer settings. All I have tweaked is the Synchronous... property on the script transform (otherwise I do not get to the Output0Buffer within the script!).
Problem:
I wish to do as much as possible in parallel. So I would expect the OLE DB Source to provide more than one row at a time to the script transform and that should process more than one input row simultaneously. It seems the script componenet is serializing input, so it seems to take one row at a time from the OLE DB source, loop through and process in the script transform).
AM I RIGHT IN THINKING THAT THE SCRIPT TRANSFORM IS EXECUTING THE INPUT IN A SEQUENTIAL MANNER?
CAN I PARALLELISE THIS?
If so, how?
I would just like to confirm something with you guys...
Am I correct in saying that you dont need multiple connections to the same DB in a SSIS package in order to achieve parallel processing across multiple SQL tasks. In other words, I have 2 SQL tasks executing different stored procedures on the same DB that I want to run in parallel. They should be able to share one connection and still process in parallel, correct?
With that in mind, would the processing be faster if they each had their own connection?
I have a very simple SSIS package that is moving data from a DB2 database to a Teradata box. I've run it around 10 times, twice it pushed data over, the balance of the time, it executes with no error, but moves nothing over. In the "incomplete" runs, a command line box pops up for half a second, then the package ends.
Does anyone have ideas as to why this behavior is occurring?
I am trying to write a ssis surrogate key data transform, my problem is I can't find an example how to add a column to the incoming columns and add some data to it. If anyone has a sample, can you please post it. I found a script option that works but I would like an actual transform.
I have an SSIS with several data flows I need to do some complex data evaluations so I have used a script as transform in two of the DFT's. If I run these separately everything works great and there are no problems what so ever. If I run them together I notices I was getting an error on the second one. I discovered that this seems to be some kind of namespace problem since both Scripts were using Input_0 buffer. So I changed the name of the second one and retested.
Well I no longer get the error and in fact it seems to run through the entire SSIS just fine. However when I look closer I notice that the second Script task just does not seem to do anything at all. The script task does a lot of evaluation of the incoming data and then does some calculations depending on the value in the service code. however when it runs through this in the second script task all of the define output rows are just empty.
I have gone through and made sure that all input and output buffers are unique names thinking this was a similar problem but no luck. I even changes all column and variable names to unique with no luck. Again If I run them separately everything work fine it is only when I run the entire package that this problem occurs.
I have an OLE DB Source and i want to transform the data type fields of the table before i export the table in an OLE DB Destination. Is there a way to transform numeric value to float, and numeric to nvchar?
I am trying to read in a flat file, transform the fields and store into a destination database.
In DTS, this works using Transform Data Task Properties. I define the columns and then have a VB script on the Transformations tab that changes any bad data.
Is there a way to do this in SSIS that I can define the column transformations and re-use my VB scripts?
I have a Pivot Transform in SSIS (2005) working perfectly, EXCEPT for that the first column of the output (the date) repeats for each of the following columns, which are themselves falling into the correct column, but not on the same line for a particular date as the others. Snipet of result from Data Viewer is:
i have too many DTS packages to migrate to SSIS, and while examining a DTS package in BIDS (converted with the migration utility) i tried to edit the resulting migrated package, which opened the DTS interface with the two connection icons joined by the big fat arrow with a gear on it...not exactly what i had in mind, iow, it looks like SSIS on the outside, but its still DTS on the inside. So I stripped out a series of components from a more complex package hoping that simplifying it would reveal the contents of old DTS Transformations tab at least partially set up in a Derived Column transformation. Can i get there from here, or must i recreate every stinking definition in a derived column manually from the ground up? thanks very much for your help
I can't figure out how to put nested tables into the Data Mining Model Training Transform (SSIS). I can do a simple case table, but how do you get those nested tables with DM Training Transformation? Any ideas? Samples?
I work in the healthcare area, and am handling the survey data ETL's. There are around 8 different survey areas and based on information received from them for the visit they reference, I want to pull in more info from our invoicing database. My idea is this:
1.) Pull in the flat file to an ODBC staging table 2.) Cache all invoice records that fall between the MIN(Date of Service) and MAX(Date of Service) from the staging table. 3.) First lookup the information needed on patientID, providerID, date of service, and billing location. 4.) For the surveys that didn't match on those 4 columns, try looking up based on patientID, date of service, and billing location (since I could be 99% sure this would still return the record I need). 5.) For the remaining surveys, lookup based just on patientID and date of service. These records will be flagged for manual review because clearly, if a patient has multiple appointments in the same day, this will be prone to error.
However, in trying to use only 3 of the columns in the lookup, I get the error saying basically that I need to utilize all 4. Is there a way around this, or is there an entirely different way I should be approaching this? The reason I thought cache transform was the answer is because I will need to run a different package for each lookup, as the data and logic between each survey will vary, but the invoice data "pool" will stay the same regardless.
In my current project i have a requirement to assign value of an aggregate transform to a variable. But i need to accomplish it without using a script task.
I would like to know what happens when a very large reference data set for a lookup transform with full caching enabled is getting loaded during package execution and the computer memory runs out or is very low. Does SSIS a) give an out of memory error of some sort b) resort to a no caching or partial caching mode c) maintain the full caching mode but will switch to using the paging file(virtual memory).
I think it will resort to using the page file in which case the benefits of in memory lookups are lost and performance would suffer. If I cannot upgrade the memory or shrink the reference set somehow, i should switch that lookup task to use partial caching or no caching with an indexed lookup table. Would this make sense?
We are using lookup transformation in SSIS 2012. The lookup transformation queries a table with two date columns. When we hover the mouse over the two columns in the 'columns' tab of the lookup transformation editor, the two columns show as DT_WSTR instead of DT_DBDATE. This causes the SSIS package to fail due to data type mismatch.A similar abandoned thread is available at: URL....
Hi JayH (or anyone). Another week...a new set of problems. I obviously need to learn .net syntax, but because of project deadlines in converting from DTS to SSIS it is hard for me to stop and do that. So, if someone could help me some easy syntax, I would really appreciate it.
In DTS, there was a VBScript that copied a set of flat files from one directory to an archive directory after modifying the file name. In SSIS, the directory and archive directory will be specified in the config file. So, I need a .net script that retrieves a file, renames it and copies it to a different directory.
I know you can change the max degree parallelism server wide, but can you do it on the fly for one query? I know... trust the query processor but when I turn it off for this one sp, my query goes from 3 seconds to 0 and I got this ex-MS guy in here telling me there is a way, but he does not remember how.
I want him to simplify the sp or have his project's DBA do it, and I even offered to take a hack but.... you know.
Does anyone know about sqlserver's Parallelism. a query without parallelism takes much less time as the one with parallelism, in my case it's 6 times faster without parallelism. If that's the true. What do we need parallelism for? Any ideas Thanks
I have a function that returns a table of information aboutresidential properties. The main input is a property type anda location in grid coordinates. Because I want to get only acertain number of properties, ordered by distance from thelocation, I get the properties from a cursor ordered by distance,and stop when the number is reached. (Not really possible todetermine the distance analytically in advance.) The cursor alsoinvolves joins to a table of grid coordinates vs. postcodes (theproperties are identified mainly by postcode), and to a tablethat maps the input property type into what types to search for.Opening the cursor typically results in the creation of six toeight parallel threads, and takes approx 1 second, which is abouthalf of the total time for the function.Recently the main property table grew from 4 million to 6.5million records, and suddenly the parallelism is lost. Takingthe identical code and executing it as a script gives parallelism.Turning it into a SP that inserts into a #temp table and thenselects * from that table as the last statement also givesparallelism. But when it's in the form of a function, there isonly one thread -- and the execution time has gone from ~2 secto ~8 sec. I updated the statistics on the table, but stillno parallelism.I could turn it into a SP easily enough, but that would involvea change to the C++ program that calls it, which takes a whileto get through the pipeline. In the meantime, is there some wayto induce the optimizer to use parallelism? It used to.
hi,i've set 'max degree of parallelism' to 1 because some sql request hanged.Now when i connect, how can i set the parallelism to 4 for a session.Is there a command like this :'alter session set max degree of parallelism 4' ?ThanksPaul
If SQL Server is designed for multi processor systems, how can runninga query in parallel make such a dramatic difference to performance ?We have a reasonably simple query which brings in data from a few nonecomplex views. If we run it on our 2x2.4Ghz Xeon server it takes 6minutes plus to run. If we run this on the same server withOPTION(MAXDOP 1) at the end of the same query it takes less than asecond.Examining the execution plan, the only difference I have been able tosee is that parallelism is taking up 96% of the run time when usingtwo processors. This drops when using the one so a sort takes up thevast majority of the time for the query to run.OK, so running in parallel should mean that it's run in various partsand then 'joined up' later for performance gains, but how can it getit so wrong (timewise) ?If this is the case, will I see a significant difference changing ourserver to use a single processor, which seems completely the wrongapproach (or should I do this on each query in each app - eek) ?Do we have a problem that we don't know about that causes it to takethis long ?What can we do ? Ideally, using both processors would seem to bepreferrable.
Microsoft SQL Server 2008 R2 (SP2) - 10.50.4000.0 (X64) Jun 28 2012 08:36:30 Copyright (c) Microsoft Corporation Express Edition with Advanced Services (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) (Hypervisor)
This is just an UAT server which has OS and hardware detail below:-
OS :- Windows Server 2008 R2 Standard SP:- SP1 Processor :- Intel(R) Xeon(R) CPU X5650 @2.67GHz 2.66 GHz RAM : - 4 GB Bit - 64 bit
I want to set the value to max degree of parallelism, what value should i configure for the same?
Below is the snap property of SQL instance >> Processor
We're experiencing a large number of deadlocks since we began runningSQL Server 2000 Enterprise Edition SP3 on a Dell 6650 with hyperthreading intel processors. We don't have the same problem on Dell6650's w/o the hyper threading. If I turn off the parallel queryprocessing option the deadlocks stop. I've tried all of the suggestionsfrom the Microsoft Knowledge Base under the following link -http://support.microsoft.com/?kbid=837983The only suggestion that actually yielded results was turning offparallel query processing but I don't want to give up what should be aperformance advantage if it wasn't for the deadlocks. Query tuning andindex tuning hasn't helped. Any suggestions? I haven't applied SP4yet. I'm wondering if anyone has seen the same problem resolved withSP4.*** Sent via Developersdex http://www.developersdex.com ***
Hi,I have a sql 2000 server with 8 processors, server settings are asdefault. I read on Technet that it is good practise to remove thehighest no. processors from being used for parallelism, correspondingto the no. of NICs in the server. One of our 3rd party developers hasrecommended only allowing one processor to be used as there is aperformance hit by the server working out which processor to use. Doesanyone have a definitive answer to this? I suspect he's wrong but I'dlike some hard evidence if possible, thanks.Kev.
Is it possible to achieve partition parallelism in SSIS? What I am asking is, In DataStage, if I load some data like 'data reader -> trans1 -> trans2 -> destination' (and assume that I have 4 nodes configured), the tool divides the data into 4 different datasets and executes the package as 4 instances. This way the data load is very fast. Is it possible in SSIS?
Of course we can divide the dataset and load them thru multiple instances? But then dividing the dataset will differ for every load and so we need to modify the package all the time. Even if we divide the dataset, I am not sure 4 instances will run in 4 different nodes or in a same node? So anybody has any idea about it?
In my package I have a source, a script component to make some changes to that and a destination. To speed up the process, within a data flow, I have created 6 copies of the above components and running them in parallel. Each source takes different set of data. I have divided the data using the record no such that, each set will read 1million records.
Now, my question is, though each pipleline is supposed to process exactly 1million records, they are not running at the same speed. For example, 1 pipeline completes processing all 1million records whereas another pipeline processed only 250000 records in that time. I don't see any reason for why one should run slow while another is running fast considering that both are doing the same thing?
actually a sever has a parallelism of 4 I would like to set the parallelism for a specific user to 2 without changing the code of the users application.
Is this possible.
As far as I understand with plan guides you just provide sql statements. Need I to find all queries from the user, and add plan guides for all the queries, or tis there a more elegant way to do it.
I have written ETL software that runs on SQL Server. We are running it for the first time on a 4cpu (2 x dual core) machine on sql server 2005.
One of the things this software does is perform a 'select * from tablename' to validate that the tables passed to it as parameters exist. This has worked fine on previous releases and on single cpu machines because what the optimiser decides to do is to return just the first page of data and then fetch more. I guess it even works in 2005 standard edition.
However, 2005 enterprise edition allows parallelism. And what the optimiser is deciding to do with such a query is to parallelise it and fetch all rows and then give the result back to the program. So, instead of seeing a fraction of a second to return the first page of data we are seeing up to 90 seconds and the database goes and fetches 15M rows in parallel.
Obviously, what we would like to do is to somehow tell the optimiser that this set of programs should not perform any parallel queries. Or, we would like to turn parallelism off on the specific tables we are dealing with for the period of running these ETL programs....they have no need of parallel processing at the database level for virtually all the calls that are performed.
Would someone please be so kind as to advise us if we can do something like pass a parameter to ODBC to stop parallelism or if we can issue commands against specific tables to stop parallelism for a period and then turn it back on?
I'm currently looking at refactoring an existing, large SSIS 2012 implementation that consists of about 55 projects and 360+ packages. The ETL framework that is in use has a "main" control package that reads from a database table and determines which packages are ready to execute (based on some dependency logic) and then uses an Execute Process task within a loop that calls dtexec with the arguments: /C start Dtexec /SQL "Some Package Path" /SERVER "someserver"
This design allows the loop to execute a package and then immediately iterate because it doesn't wait for the package to respond (aka complete with a failure or success) so it can quickly kick off as many packages are ready to execute. A SQL Agent job is used to call this package every few minutes so that it can pick up any packages that have had their dependencies satisfied since the last execution and kick those off.It's a a clever design but has some problems such as decentralized exception handling (since the parent package is unaware of what is happening in the "asynchronous" dtexec calls.My biggest concern is that by executing packages, not with the Execute Package Task but with the Execute Process Task, and spinning up many dtexecs, the framework is not leveraging SSIS's ability to handle threading, memory consumption, etc. across all running packages and executables because it is simply unaware of them. It's essentially like using an Execute Package Task with the ExecuteOutOfProcess property set to true.
I have got a question on max degree of parallelism and CPU cores.
If max degree of parallelism = 1, this signifies that sql will use serial execution plan (unless u change it in query level with MAXDOP hint). In serial plan, will the query use all CPU cores (say in my server I have 16 core processors)?
If in serial execution plan only one thread works, then what the other threads doing ? Idle (I may have a defined max server worker thread = 32767(by default)
Unable to create a relationship between this parameters.
Referencing an article regarding MAXDOP and cost threshold for parallelism from Brent Ozar's website: [URL] .....
We have a 2 physical CPUs that are 4 cores each with hyper threading enabled. When looking through the task manager, under the performance tab, I see 16 CPU threads.We have set the MAXDOP value is set at 4.
Reading further, cost threshold for parallelism setting is recommended at 50 to start with.