Doing ETL For The Sake Of ETL?
Apr 11, 2008I often run into this situation where I need to load data into a data warehouse or data mart, and many times the data needs to be summarized or aggregated first, unioned/joined/merged with other data, pivoted, or have derived columns (e.g. % of totals, etc.) before arriving it the destination.
Because SSIS has all of these capabilities, I've developed packages that take each source table untouched and perform all transformations within the package itself, but if all the source data is coming from the same connection and source (often SQL Server), I can just as easily write a SQL query using all the power of the TSQL language to group, sort, pivot, use CTEs, derive columns, etc. and take it directly to the destination.
The advantage of the first method is you can easily migrate from one source type (e.g. relational DB) to another (eg. flat file) or deal with sources that do not have a SQL language to harness.
The disadvantage of the first method is it becomes more complex to develop and maintain, more points of failure thus more logging/error handling/debugging. Also, if source systems change, you cannot be guaranteed that the data will remain constant and thus need to modify the package anyways.
The advantage of the second method is speed. You have complete control over the optimization of the query against the relational source, and set operations will always outperform ETL because of the overhead involved.
The disadvantage of the second method is portability. If you change sources, you will have to rewrite the query, or abandon it altogether and create everything within the package anyways.
So my question is this: as a best practice and general rule, should you develop the ETL within the package and never use SQL statements, even if performance suffers and managability is impacted, or when you know the source system won't change (at least for the foreseeable future), should you harness the power of the relational engine and speed up (and simplify) your packages by using SQL queries?
I am posing this as a hypothetical question, and I know the answer "it depends", but I am currently working with a client that has packages that seem very convoluted (typical dataflow: source-aggregate-sort-merge-source-sort-merge-lookup-union-sort-merge-source-sort-merge-lookup-lookup-lookup-destination) when the data volume is small (< 100,000) and all tables are in the same database, and a simple SQL statement with a few joins, subselects, and group bys would accomplish the same thing, shaving 2 minutes off the process. Also when they modify the package, it inevitibly fails the first few times until the bugs are worked out, which may take a couple hours. The SQL statement can be run over and over again in SSMS to get the desired results, all within a matter of minutes, making the development process much more efficient.
So which is the better practice?
(sidenote: as a general practice, I typically stage all data first, so pulling from heterogeneous sources into a SQL Server staging area, so all data will then be sourced from SQL Server, eliminating the argument of "what happens if the source type changes" because the sources do change, but I handle that in my staging environment)