DTS in SQL Server 2000 provided great capabilities to move data from one place to another or from one format to another. In fact, many data-processing architectures built on DTS adopted this model. In moving from DTS to Integration Services, however, should you continue to use this model? Using the DTS-based architecture as a comparative model, you can explore new ways of thinking about data integration.
That Was Then: DTS
You may be familiar with SSIS's predecessor, SQL Server 2000 Data Transformation Services (DTS). If not, DTS is a widely used tool designed to move data from one source to another, and many organizations have developed compelling BI ETL solutions using DTS. The differences between SSIS and DTS are many, though. In fact, SSIS is not a new version of DTS; it is a new tool in the SQL Server 2005 platform that replaces and extends DTS's functionality.
Does Figure 1 look familiar? If you've developed any DTS-based solutions to handle data processing, then surely you recognize it.
Since the topic in this article is Integration Services, the analysis of DTS will not be exhaustive; however, you will consider a few aspects of a typical DTS architecture in order to compare it with an Integration Services architecture. Ideally, the design of your ETL or data-processing solution will not be driven by your choice of ETL tool; however, a tool's capabilities inevitably influence the design. Source-to-destination data mappings are still the same and destination data structures do not change, but the process and logic to move and transform data from one to the other can be dependent on the capabilities of the tool. This was the case with DTS (in fact, this applies to any tool that is selected, but the difference is how much flexibility the tool will allow in data transformation design). The following sections consider the common design of a DTS package such as the one shown in Figure 1.
DTS Solution Architecture
DTS primarily uses a Data Pump task that moves data from a single source to a single destination with some interim transformation logic. The DTS interface focuses on the workflow features of the product, with the Data Pump being an object with configurable properties. A few built-in transformations can be defined, such as the built-in character mapping transforms. If transformation logic is included in the Data Pump, it is done in the form of a script. Some limitations are obvious, such as the one-to-one source-to-destination cardinality and the limitation of out-of-the-box transformations.
While scripting transforms can provide broad capabilities, scripting in itself introduces complexity, embedded code, and componentization scalability and support challenges. Because of these limitations, the approach that many DTS-based solutions use is to leverage the relational engine and use staging databases as interim layers of ETL. For example, if you needed the data source to split to multiple destinations, DTS would need to either make multiple passes from the source or stage the data and handle the partitioning of the data through another set of data pumps. Staging in DTS designs is also often used for cleansing, data correlation, and dimension lookups, again leveraging the relational engine. The pattern for data processing in DTS starts with a source. Then, raw data is staged. This data is transformed to an interim staging environment where lookups, data comparisons, and cleansing logic are applied. The final step in many cases loads the data to the destination, where perhaps a final cleanup may be applied.
Common DTS Processing Practices
The approach highlighted in Figure 1 and described previously contains two general features of this data transformation model. Figure 2 highlights the staging-intensive nature of the process.
In this example, two staging tables have been used in between the source and destination in order to prepare the data for the load. Furthermore, when drilling into the data pumps that move the data from one table to another, oftentimes a SQL statement does some cleansing and correlating as the data is picked up from one staging table to another. The following code shows the source query used in Step 3 of the package that loads the header table.
SELECT STGOnlineDailyOrderBulk.OnlineCustomerID , STGOnlineDailyOrderBulk.OrderDate , SUM (UnitPrice) AS SubTotal , SUM (UnitPrice)* MAX(ISNULL(TaxRate,0.08)) AS TaxAmt , ISNULL(MAX(Address.AddressID), 1) AS BillToAddressID , ISNULL(MAX(Address.AddressID), 1) AS ShipToAddressID , SUM(freight) AS freight , DATEADD(D,7,OrderDate) AS DueDate FROM dbo.STGOnlineDailyOrderBulk INNER JOIN dbo.STGOnlineCustomers ON STGOnlineDailyOrderBulk.OnlineCustomerID = STGOnlineCustomers.OnlineCustomerID LEFT OUTER JOIN dbo.TaxRate ON STGOnlineCustomers.PostalCode = TaxRate.PostalCode LEFT OUTER JOIN AdventureWorks.Person.Address Address ON STGOnlineCustomers.AddressLine1 = Address.AddressLine1 AND STGOnlineCustomers.PostalCode = Address.PostalCode AND STGOnlineCustomers.City = Address.City GROUP BY STGOnlineDailyOrderBulk.OnlineCustomerID , STGOnlineDailyOrderBulk.OrderDate
In this particular example, the transformation and data association is handled by an intense query on the relational staging database. Note the use of multiple joins between the staging and the destination tables and SQL logic to perform some in-process cleansing.
The second feature of this approach is the in-place set-based transformation using the RDBMS engine to update data within the staging or destination tables. Figure 3 highlights the DTS Execute SQL task that updates one of the staging tables in place to capture a primary key on the destination table.
In Step 5, after the header table is loaded, the source SQL statement for the Data Pump joins the details staging table with the destination header table to return the
IDENTITY column in the header.
SELECT SalesOrderHeader.SalesOrderID, STGOnlineDailyOrderBulk.* FROM dbo.STGOnlineDailyOrderBulk INNER JOIN AdventureWorks.Sales.SalesOrderHeader SalesOrderHeader ON SalesOrderHeader.CustomerID = STGOnlineDailyOrderBulk.OnlineCustomerID AND SalesOrderHeader.OrderDate = STGOnlineDailyOrderBulk.OrderDate
This example clearly does not represent all of the DTS-based solutions, nor does it leverage the lookup functionality embedded in the DTS Data Pump. The purpose of this example is to highlight the features of a typical DTS solution in order to compare it with an Integration Services solution.
As mentioned earlier, this architecture has a few limitations that affect scalability and may have an unnecessary impact on the destination systems. Generally, the DTS approach can be thought of as "ELT" instead of ETL. In other words, the common sequence for data processing is Extraction, Transformation, and Loading (ETL), but DTS solutions typically switch the order of the "T" and the "L." In this model, loading is commonly performed before transformation logic is applied. A few other limitations can be articulated from this generalized description.
One drawback to this approach is its synchronous nature. Some parallelization is used, but most data-processing steps require the full completion of the previous step in the workflow. Since this process is chained together, the total processing time will be a multiple of the duration of the extraction process, especially in a high-volume performance-driven environment. The Gantt chart in Figure 4 illustrates the total time taken to process the DTS package in Figure 1.
The second drawback to this approach is its disk-intensive nature. Every time data is inserted or updated in the database, it passes through several OSI layers, which adds resource overhead and stresses the system. Disk IO processes are expensive and generally slow down a system. To be sure, writing to the database is required in the loading phase, and often business requirements dictate that some interim database activities are necessary. The point, however, is that an architecture with data staging at its center has processing overhead that limits scalability. In fact, every step in the package from source to destination requires disk IO, including the set-based update statements, which at times may be more IO-intensive than pulling and writing data between data staging tables.
Related to the disk-intensive nature of this approach is the impact on the RDBMS when a process uses it for data correlation, lookups, and cleansing. This may or may not be relevant in every solution or even in the different layers of an ETL design, but consider a design where foreign keys are added to a staging table or queried in a custom source SQL statement through a series of database joins. Since the keys reside in a production destination database, even if the SQL statement is distributed across servers, there is a hit on the production system when that portion of the package runs. When the volume of the related destination tables starts to scale up, and if indexes are not optimized for loading, the load can really stress the RDBMS resources-processor, RAM, and IO. Whether the primary purpose of the destination database is to perform queries, report, or handle day-to-day transactions, this impact may affect the end users of your system. It may reduce productivity or, even worse, create a negative perception among end users of the value of the system.
These concerns notwithstanding, good design is possible using DTS, but the challenge comes with the required time, effort, and ongoing support. These DTS-based processes have also provided you with valuable lessons to apply when developing solutions with Integration Services. Finally, DTS may be perfectly appropriate for some systems; however, by examining the common pitfalls of DTS-based solutions, you will be better able to determine if DTS is a viable solution.
This Is Now: Integration Services
So here you are — a new tool and the opportunity to reconsider your data-processing architecture. In an ideal world, the tool would not be part of your conceptual design. This may never ultimately be achievable, but with Integration Services, solutions can be designed with much more flexibility and extensibility given the nature of the product. Designing a data-processing solution requires more than just sending the source data into a black-box transformation engine with outputs that push the data into the destination. And of course, system requirements will dictate the final design of the process, including but not limited to the following:
- Source and destination system impact
- Processing time windows and performance
- Destination system state consistency
- Hard and soft exception handling and restartability needs
- Environment architecture model, distributed hardware, or scaled-up servers
- Solution architecture requirements such as flexibility of change or OEM targeted solutions
- Modular and configurable solution needs
- Manageability and administration requirements
In reviewing this list, you can quickly map several of these to what you have learned about Integration Services already. In most cases, a good architecture will leverage the built-in functionality of the tool, which in the end reduces administration and support requirements. The tool selection process, if it is not completed before a solution is developed, should include a consideration of the system requirements and functionality of the available products.
The focus of this article is not to provide an exhaustive Integration Services architecture but rather to provide a framework for design that models true ETL (in that order). Integration Services brings with it a platform that is able to meet a broad variety of business data-processing needs and handle the required volume and integration complexity within the confines of the engine itself.
Integration Services Design Practices
In the previous analysis of DTS, you looked at a few challenges to the architecture. In doing so, however, you've uncovered a few good principles to follow when designing an Integration Services solution:
- Limit synchronicity
- Reduce staging and disk IO
- Reducing the reliance on an RDBMS
You've already looked at a few reasons why these are important principles to consider, and the result of following these will be more apparent as you compare your DTS solution to an Integration Services approach.
Keep in mind that solution requirements often drive design decisions, and there are situations where staging or the RDBMS are useful in data processing. Some of these are discussed in this section. Your goal, though, is to rethink your design paradigms with Integration Services.
Leveraging the Data Flow
For sure, the biggest value that Integration Services bring is the power of the Data Flow. Not to minimize the out-of-the-box functionality of restartability, configurations, logging, Event Handlers, or other Control Flow tasks, the primary goal of the engine is to "integrate," and the Data Flow is the key to realizing that goal. Accomplishing data-processing logic through Data Flow transformations brings performance and flexibility.
Most data architects come from DBA backgrounds, which means that the first thing that comes to their minds when trying to solve a data integration, processing, or cleansing scenario is to use an RDBMS, such as SQL Server. People gravitate to areas they are comfortable with, so this is a natural response. When your comfort in SQL is combined with an easy-to-use and low-cost product like DTS, which in many ways relies on relational databases, the result is a widely adopted tool.
Moving to Integration Services in some ways requires thinking in different terms — Data Flow terms. The focus in this section will be on applying some of those components into design decisions and translating the SQL-based designs into Data Flow processes.
The three architecture best practices relate directly to the value that the Data Flow provides:
- Limit synchronicity. By bringing in more of the processing logic into the Data Flow, the natural result is fewer process-oriented steps that require completion before moving on. The general streaming nature of the Data Flow (streaming in terms of moving data, not audio or video streaming) translates to reduced overall processing times.
- Reduce staging and expensive IO operations. The Data Flow performs most operations in memory (with occasional use of temp folders and some interaction with external systems). Whenever processing happens on data that resides in RAM, processing is more efficient. Disk IO operations rely on the performance of the drives, the throughput of the IO channels, and the overhead of the operating system to write and read information to the disk. With high volumes or bursting scenarios typical with data processing and ETL, disk IO is often a bottleneck.
- Reduce reliance on RDBMS. Relational engines are powerful tools to use, and the point here is not to detract from their appropriate uses to store and manage data. By using the Data Flow to cleanse and join data rather than the RDBMS, the result is reduced impact on the relational system, which frees it up for other functions that may be higher priority. Reading data from a database is generally less expensive than performing complex joins or complicated queries. In addition, related to the first bullet, all RDBMS operations are synchronous. Set-based operations, while they are very useful and optimized in a relational database system, still require that the operation be complete before the data is available for other purposes. The Data Flow, on the other hand, can process joins and lookups and other cleansing steps in parallel while the data is flowing through the pipeline.
Data Integration and Correlation
The Data Flow provides the means to combine data from different source objects completely independent of the connection source where the data originates. The most obvious benefit of this is the ability to perform in-memory correlation operations against heterogeneous data without having to stage the data. Said in another way, with Integration Services, you can extract data from a flat file and join it to data from a database table inside the Data Flow without first having to stage the flat file to a table and then perform a SQL join operation. This can be valuable even when the data is coming from the same source, such as a relational database engine; source data extractions are more efficient without complex or expensive joins, and data can usually begin to flow into the Data Flow immediately. In addition, single table
SELECT statements provide less impact to the source systems than do pulls where join logic is applied. Certainly there are situations where joining data in the source system may be useful and efficient; in many cases, however, focusing on data integration within the Data Flow will yield better performance. When different source systems are involved, the requirement to stage the data is reduced.
Several of the built-in transformations can perform data correlation similar to how a database would handle joins and other more complex data relationship logic. The following transformations provide data association for more than one data source:
- Merge Join
- Union All
- Fuzzy Lookup
- Term Lookup
- Term Extract
Beyond the built-in capabilities of Integration Services, custom adapters and transformations allow more complex or unique scenarios to be handled. The two most commonly used and powerful data correlation transformations are the Lookup and Merge Join.