When designing ETL solution for a data warehouse, it is very important to make sure your ETL process is optimized. We usually Extract data from one or more data sources, Transform the data by applying data transformations and then Load data into data warehouse. This ETL approach works fine but in certain situations or scenarios it does not works well. Let’s look at how SSIS transformations work and how we can optimize it.
In SSIS, the data flow engine uses buffers to load data into memory and then use the in-memory data sets to transform data. The data flow transformations can be categorized into
Row Transformations – These transformations process data on a record by record basis. Examples of row transformations are Data conversion, Derived column and Lookup transformations. Row transformations are also called synchronous transformations.
Partially Blocking Transformations – These transformations have multiple data inputs and the transformation cannot start until it receives data from all inputs. These transformations are asynchronous transformations. Examples include Merge, Merge Join and Union All.
Blocking Transformations – These transformations must read all input data before processing it. These transformations are also asynchronous in nature. Examples include Aggregate and Sort. For blocking transformations SSIS allocates a new buffer and a new process thread for its output data.
To optimize your ETL process, you should Sort or Aggregate data in the source query whenever it is possible. You can do the data conversions or add derived columns or instead of Lookup transformations, you may Join table(s) in the source query.
When you have huge number of records, I think it is best to first load all records in a staging table using a data flow component. Once records gets loaded, use one or more Execute SQL tasks to perform the required Insert/Update/Delete or any other complex operation using T-SQL query or stored procedure. This technique is called ELT (Extract, Load and then Transform). You first load all records and then transform data. In ELT, the source and destination database is the same. You may apply ELT technique on either staging database or data warehouse. First you load data in staging or data warehouse and then use SQL statements or stored procedures to do the transformations. In ELT, main processing is done by the database engine. It utilizes Query Execution Plan, Indexes and Partitions which works very well with large sets of data.
I have been using ELT in my data integration projects and I found it to be faster and more efficient compared to regular ETL. Based on your scenario, this ELT approach may lead you to a hybrid strategy like ELTL or ETLT.