dbt
Blog Understanding data transformation frameworks

Understanding data transformation frameworks

It takes work and skill to turn raw data into business insights. Data transformation is an indispensable part of this process that data engineers use to clean, enrich, integrate, and store data from multiple sources in a more accessible format.

Doing data transformation well, however, requires more than just writing a bit of code. Low-quality data transformation code results in low-quality data—which, by some estimates, costs organizations almost $13 million every year.

This is why any company doing data transformation at scale needs a data transformation framework. A framework provides tools that bring consistency to the data transformation process, enabling data stakeholders to create, track, verify, deploy, and monitor code across the analytics lifecycle.

We’ll take a deeper look at what a data transformation framework is, why you need one, and how to build one with minimal upfront investment.

What is a data transformation framework?

A data transformation framework is a system of processes and tools that enables managing data transformation pipelines across your organization so you can ship new or revised analytics code with both high quality and high velocity.

From a pure technical standpoint, teams can create data transformations in a number of languages and run them in a number of ways:

  • One group of data engineers may write scripts in Python and run them as cron jobs on a virtual server somewhere
  • Another might write SQL-based transformations, bundle them into a Docker container, and run them on their preferred cloud provider’s container orchestration platform
  • Yet another might build this code out as stored procedures in their data warehouse

The problem with this scattershot approach is that it doesn't provide any centralized visibility or management of a critical corporate asset. ‌Over time, this results in:

  • Lack of traceability of analytics code changes, as some developers may not be using source code control to manage changes
  • Lack of code reuse, which forces multiple teams to solve the same problem multiple times in multiple different ways
  • Lack of consistency as teams take inconsistent approaches to transforming tables and fields, resulting in discrepancies and errors that take time to track down and fix
  • Lack of quality as some engineers run analytics code changes in production with little—and in some cases, no—testing

Companies don't end up in this spot for no reason. They get here because they have to manage data over a heterogeneous collection of data sources spread across multiple groups and divisions around the globe.

A data transformation framework solves for this by providing a centralized, common, and vendor-agnostic approach to writing, testing, deploying, and operationalizing data analytics code. It works across all of the major data sources and destinations used in today's modern data stacks to provide a consistent approach to creating data pipelines no matter where data lives.

Components of a data transformation framework

There are two fundamental components of a data transformation framework - one focused on methodology and the other on tools:

  • Data lifecycle management process
  • Data control plane

Let's look at each one of these in detail.

Data lifecycle management process

Many analytics code changes are still performed haphazardly and driven by engineering resources. This results in variable quality of data transformation code across the organization. It also risks data transformation work becoming disconnected from business goals.

In addition to a common toolset, companies need a common process. Mature analytics data lifecycle involves all data stakeholders from the get-go and focuses on shipping small, high-quality releases.

At dbt, we refer to this process as the Analytics Development Lifecycle (ADLC). The ADLC is a vendor-agnostic approach to managing analytics code changes in which all data stakeholders—engineers, analysts, and business decision-makers—work together to ensure the success of each release.

The ADLC defines the following stages and processes for creating high-quality data transformations:

Plan. Gets all stakeholders on the same page, validates the business case, anticipates downstream impacts, sets out security guidelines, and defines success for the project.

Develop. ‌Represents all analytics changes as source code governed by version control, invests in code quality (e.g., by creating DRY analytics code as well as reusable code modules), and institutes a review process to ensure high-quality releases.

Test. ‌Creates unit and data tests to verify data transformation logic at multiple points during the development and release process.

Deploy. ‌Uses automated processes to test and verify all changes in pre-production environments before making them live for customers.

Operate and observe. Test changes continuously in production; the system tolerates and recovers from failure, and engineers can use metrics to continuously measure performance and data quality.

Discover and analyze. Enables analysts and business users to discover the data created by data transformations in a self-service manner, verify its data lineage, and provide feedback.

Data control plane

While the ADLC provides a process, it doesn't provide a toolset for implementing it. Processes such as code review, testing, and deployment of changes are easier if you have tools that can work uniformly across all the data stores in your enterprise. Without this, your teams might take varying approaches to implementing data transformation logic that lock you into a number of vendor-specific solutions and make practices such as code reuse more difficult.

The next piece of the puzzle in your data transformation framework is a data control plane. The data control plane is an abstraction layer that sits across your data stack and provides unified capabilities for data transformation, orchestration, observability, and more. ‌It also centralizes metadata across your business, giving you a bird's eye view of everything happening across your data estate.

A data control plane provides three major benefits:

Avoid vendor lock-in. Provides a method for modeling and transforming data whether it’s in Snowflake, Databricks, Amazon Redshift, Fivetran, Airbyte, or elsewhere. A common data modeling platform means you can more easily re-share code and data modeling expertise across use cases.

Make analytics a team sport, with guardrails. ‌Without a common data transformation framework, you can easily end up with complex data pipelines that only data engineers can run and modify. By utilizing a common modeling solution that leverages transformation languages, such as SQL and Python, you can open data transformation work to a larger number of stakeholders, including analysts and even technically savvy decision-makers. Testing, code reviews, and other quality processes provide a set of guardrails to verify code from contributors before any changes go to production.

Promotes data quality and trust. A good data control plane can also generate documentation as well as data lineage for all data models. This means business stakeholders can discover and verify the origin, quality, and purpose of data in a self-service manner. This results in stakeholders making greater use of data, resulting in better and more timely decision-making.

How a data transformation framework works in practice

A good data transformation framework provides a data control plane for all of the data transformation logic and metrics in your business that supports and reinforces the various practices codified by the ADLC.

dbt Cloud is one such data control plane that’s built from the ground up to manage trusted data at scale across your enterprise. Let’s look quickly at how dbt Cloud implements the various aspects of a data transformation framework with a quick walkthrough. This explanation assumes you’re using Snowflake—but you can also use a number of other popular data sources and destinations.

After connecting to a data source, such as Snowflake, a data engineer can create a dbt project and associated Git repository. This places all of their code under source control, which enables traceability of changes and will also power their analytics code deployment processes. Anyone who wants to contribute to the project can work on different Git branches, keeping their changes isolated from production code (and other developers) until they’re good enough to ship.

Next, the engineer defines dbt models, which define data transformation logic using either SQL or Python code. For example, the following model combines data from the customers, order, and customer_orders tables to create a unified customer order data set with valuable customer metadata attached. This is where they would implement their data transformation logic, including any cleansing, aggregation, and other common data transformations they need to perform.

with customers as (

    select
        id as customer_id,
        first_name,
        last_name

    from raw.jaffle_shop.customers

),

orders as (

    select
        id as order_id,
        user_id as customer_id,
        order_date,
        status

    from raw.jaffle_shop.orders

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final

Using dbt, data developers can create complex models, even importing other projects and building models on top of other models. They can also create tests and documentation alongside these models.

When the engineer’s ready to commit their changes, they file a pull request (PR) to ask for a code review. Your team can set up dbt Cloud so that this request also runs and verifies all tests associated with the engineer’s changes, providing another level of quality control.

Using dbt Cloud, your team can also set up Continuous Integration (CI) jobs to further test changes before promoting them to production. Once a PR request is approved, it can trigger a CI job that tests the changes against mock data in a non-production environment. This provides added confidence that the engineer’s changes will work as expected in prod.

Once the changes are live, analytics and decision-makers can discover the data via dbt Explorer, using the associated documentation and data lineage to guide their usage. They can then provide feedback on the finished product to drive the next iteration of the ADLC—e.g., by identifying further opportunities for data aggregation or enrichment.

Implement your end-to-end data transformation framework today

A data transformation framework consists of both processes and tools. There’s little you can do to speed up the process component; it takes hard work and time to change old habits. However, using a platform like dbt Cloud, you can implement a data control plane in far less time than it would take you to build your own from scratch.

With dbt Cloud as your data control plane, your data teams have a standardized and cost-efficient way to build, test, deploy, and discover analytics code while data consumers have purpose-built interfaces and integrations to self-serve data that is governed and actionable. To see how it can work at your company, ask us for a demo today.


Last modified on: Mar 12, 2025

dbt Developer Day

Join us on March 19th to hear from dbt Labs product leads about exciting new and coming-soon features designed to supercharge data developer workflows.

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Recent Posts