Data transformation and ETL
“Data transformation” and “ETL” are two terms you’ll come across as a data professional. But what do they mean, and what’s the difference between them?
In fact, they aren’t opposing ideas; one is a part of the other. Data transformation is the “T” (“Transform”) in ETL (Extract, Transform, Load). With the introduction of the cloud and the ensuing rapid evolution of the data ecosystem, ETL too has evolved, and a new approach has emerged: ELT (Extract, Load, Transform).
In this article, we’ll look at ETL, its relationship to data transformation, and how that relationship is changing for the better with the shift to ELT.
What is data transformation?
Data transformation is the process engineers use to take data from one state to another. Engineers use a query system like SQL, or a transformation package for a programming language like Python, to write code that takes one table of data and turns it into another table, set of tables, or views on the original table.
Data transformation is the heart of data engineering. It allows engineers to clean, normalize, and standardize data to use it for further analysis. Engineers use it to take prepared data and turn it into meaningful aggregates and analytics. Every step of a data processing pipeline is built with data transformation.
Data transformation turns raw data into meaningful analytics useful for decision-making. It establishes a quantitative groundwork for understanding an organization's state. It’s the mechanism that turns data into information.
How data transformation fits into ETL
ETL stands for “Extract, Transform, Load,” and it outlines a sequential pipeline for data processing. Data transformation is the second step of this process. An ETL pipeline for an analytics dashboard is:
- Extract: A user system connects to a data store and brings it into memory on an instance of computing (e.g., a virtual machine)
- Transform: The user takes the data and re-formats it - e.g., by converting strings into integer values, or ensuring a customer ID field is using a consistent format across tables
- Load: The user stores the transformed data in a new location—typically an OLAP database—where a downstream process (e.g., an analytics dashboard) can aggregate and use it to drive business decision-making
ETL is an architecture that was designed many decades ago, when client-side storage and bandwidth were limited and costly. All data is in the source database and only select data is loaded into a central repository after it has been transformed. A single transformed table is much smaller than an entire section of a database. Linking those processes together makes sense when you want to limit how much data is moving across your network and sitting on hard drives.
But that efficiency comes at a price. In a modern environment, ETL has serious flaws that get in the way of scaling data operations.
The overall problem is inflexibility. Every transformation is defined before data is brought in, so each project is constructed ad hoc, even if it is using similar transformations as another. Similar cleaning and preparation processes are repeated across the organization.
Different teams in different branches of the organization develop different metrics when building their dashboards, leading to conflicts in understanding. This chaotic development stack leads to inefficient workstreams, as every pipeline builds from the ground up. Every project that wants to incorporate data must start from the beginning and bring significant data engineering expertise.
ETL also increases the time it takes to deliver value. In an ETL process, data consumers don’t have access to the raw underlying data. They have to wait on the ETL process to deliver it to the data warehouse. That means they have to communicate their precise business needs to the data engineering team in detail. This drawn-out process means it can take anywhere from days to months for business analysts and decision-makers to get the data they need.
How ELT improves transformation workflows
ELT (Extract, Load, Transform) is a newer architectural approach that has emerged as data systems have moved onto the cloud. In this paradigm, data transformation is the final step of the process and happens after data has been loaded into the centralized data store.
An ELT pipeline takes the following steps:
- Extract: An analytics engineer connects to a database of raw data
- Load: The engineer brings relevant data segments into a data platform as is.
- Transform: The analytics engineer builds a pipeline of transformations in the warehouse on top of the loaded data. These pipelines clean data, prepare it for further analysis, and establish baseline metrics. Data analytics teams build additional pipelines on these transformed tables, producing the final deliverable (a dashboard, an ML or AI application, etc.).
In ELT, transformation happens directly inside a data platform, allowing for centralized, organized data processing pipelines. Engineers can pull interesting data from data sources into the data platform, and because storage is so cheap in the cloud they can aggregate any data they think might be useful for analytics. Once loaded, data teams clean (“transform”) the data and create ready-to-use tables for any relevant analysis.
Analytics pipelines for one project can be accessed and reused by teams doing similar tasks. Engineers can build standardized metrics so that any team that uses them starts from a single source of truth. A framework of engineered tables to build on allows teams to hook into data without starting from a raw state, lowering the knowledge and effort barriers for all sorts of projects.
Supporting ELT with dbt data transformation
Organizations need tools to support the potential of an ELT architecture. That’s where dbt comes in.
dbt is a SQL-first data transformation workflow that allows teams to quickly and collaboratively deploy analytics code using software development best practices. This gives data teams the agency, control, and visibility needed to deliver world-class data products at scale and promotes data quality and trust.
dbt’s Cloud delivers dbt as a service with the critical security, governance, automation, and collaboration features required for managing data complexity at scale.
dbt also automatically tracks and displays the dependencies between different transformed tables. That means engineers don’t have to dig through documentation trying to decipher which table depends on which.
dbt also promotes data quality and accuracy by allowing users to configure tests that are automatically applied to new datasets. Engineers can define their tests and be confident that their data pipelines meet the required standards to build trust in their outputs. dbt’s debugging capabilities automatically check developed pipelines for errors, highlighting where things have broken down so that engineers know where to look when things go wrong.
All of these features add up to a system designed to maximize the benefits of an ELT architecture. Using dbt, engineers can easily organize their efforts, collaborate, share their work, and automate repetitive workflows. The support dbt provides opens up more engineering time to develop new metrics, optimize pipeline efficiency, and explore datasets to find new valuable information.
Faster time to value with dbt
Data transformation is the core of ETL and ELT data processing pipelines. As data systems have shifted to the cloud, ELT has become the standard thanks to the structure it can provide for data engineering projects. dbt offers tools like code-based transformation logic, pipeline diagrams, and automation to maximize the potential of ELT pipelines.
Want to learn more about how dbt supports ELT? Check out this guide. Want to see if dbt is right for your organization? Create an account and book a demo today.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.