Building an ETL pipeline: Best practices

on Mar 27, 2025

Data transformation is the heart of data quality, turning raw data from multiple sources into high-quality information that business decision-makers can mine for insights.

At the heart of the data transformation process is the data pipeline. Most data engineers are managing and maintaining dozens of such pipelines across dozens or more data sources. Running these with high performance and high quality requires implementing a set of best practices to manage changes, facilitate collaboration, and push changes to production safely.

The traditional approach to implementing data pipelines is ETL—Extract, Transform, and Load. We’ll look at both ETL and its modern successor, ELT (Extract, Load, Transform), and look at the key best practices you should implement to ensure high-quality data transformations.

ETL vs. ELT: Identifying the “best” workflow for data

ETL and ELT are sometimes used interchangeably these days. However, historically, they take slightly different—but significant—approaches to working with data.

Both processes take raw data from one or more sources and apply a series of data transformations to create a new dataset in a target system. The difference lies in when the transformation occurs:

ETL transforms data, often semi-structured or unstructured, before loading it into the target system
ELT, by contrast, loads the raw data, often from other relational or data warehousing systems, into the target system and then transforms it in-place

For the most part, in a cloud environment, you’ll want to use an ELT workflow, which is what most organizations are doing today. There are a few reasons for this:

Scalability
Flexibility
Cost efficiency
Data democratization

Scalability

Compared to ETL, ELT scales better because it takes advantage of the processing power of cloud-native systems like Snowflake and Redshift. Once data is loaded, these systems can transform it efficiently at scale.

ELT also generally handles complex datasets better—and faster—than ETL. In ETL, the larger the dataset becomes, the harder it is to manage, as the data must be fully transformed before it can be stored in its final form.

Flexibility

In ETL, the source data is transformed en route, and only the transformed version ends up in the data warehouse. This can lead to a couple of issues:

If the original data isn’t preserved anywhere, it’s lost forever. That can have severe implications for compliance and data governance, especially if errors are discovered in the pipeline.
The new data format might not be the ideal format for all teams.

ELT preserves more flexibility and accountability by loading the original data first. This guarantees that you always have a copy of the original. It also enables using the data for different use cases.

Cost efficiency

ETL has historically required specialized hardware and software to implement. By contrast, ELT makes use of the native capabilities of cloud systems, standing up compute resources (like virtual servers) when transformation work is needed and shutting them down immediately afterwards.

Data democratization

In ETL, no one except the data engineering team typically has access to the original data until it gets loaded into the data warehouse in its final form.

In ELT, the original data is immediately available in the data warehouse. This enables other teams to discover it and build their own data pipelines immediately, without waiting on the centralized data engineering team to act on a support ticket.

Best practices for ETL and ELT pipelines

For the most part, then, a key best practice for ETL pipelines is: Use ETL. There are, however, a few exceptions:

You’re dealing with a small dataset that requires complex transformations
You have to integrate with a legacy or third-party system that works better with an ETL approach
Scenarios such as Internet of Things (IoT), where you want to perform transformations to combine data from different formats before storing it, or you need to de-duplicate entries

Whether you use ETL or ELT, however, there’s a set of best practices you can follow to improve the quality, reliability, and performance of your data transformations.

Have a workflow process in place

Many data pipelines—even today—are driven primarily by engineering and managed on an ad hoc basis. There are few processes in place to establish SLAs, version pipelines, and respond to data pipeline issues. That leads to:

Data that diverges from business requirements
Untested (or undertested) data changes pushed to production
Broken pipelines and outdated reports that lead to impaired trust in data
Slow decision-making velocity

We originally created dbt to address what we saw as deficiencies in the data workflow process. We sought to bring some of the best practices from software engineering - version control, deployment pipelines, testing, and documentation—into the data world.

Today, we’ve expanded that mission by championing what we call the Analytics Development Lifecycle (ADLC). Like its counterpart, the Software Development Lifecycle (SDLC), the ADLC aims to break down the barriers between data team members—data engineers, analysts, and business decision-makers. It ensures each new data pipeline or pipeline improvement aligns with core business objectives and has measurable key performance indicators (KPIs).

From a quality standpoint, the ADLC focuses on creating small, well-tested changes that engineers promote to production with high velocity. Once changes are in production, they’re monitored, operationalized for performance, and made available for analysts and business users to discover in a self-service manner.

Use version control and enable collaboration

A major source of errors in ETL and ELT pipelines is the lack of a central repository for a data pipeline’s source code. At a minimum, all pipelines should be tracked and versioned using a version control system. That guarantees that all data changes are represented as code and can be discovered, changed, or even reverted as needed.

Besides enabling tracking changes, version control enables collaboration among data engineers, analytics engineers, and other technically savvy data users. It provides a single source of truth that people can discover, modify, and submit for approval. That enables more people across the company to contribute to building and maintaining data pipelines.

Build a culture of quality around data workflows

According to Monte Carlo, half of all companies surveyed said that 25% of their revenue is impacted by poor data quality. That speaks to a need to build quality controls and measures into every part of the ADLC.

Part of building a culture of quality means that engineers should be creating tests for all new data pipelines and for any changes they implement. Equally important, however, is implementing a version control-based process for enforcing quality checkpoints:

Engineers should work in their own source control branches, separate from the “main” branch containing production code
After testing their changes locally, engineers create a pull request (PR) to push code into the main branch
The PR triggers a manual review of their code, as well as an automatic run of any tests they’ve written
Once approved, a Continuous Integration/Continuous Deployment (CI/CD) process further tests and deploys their changes automatically to production

Keep separate environments for each part of the dev process

No one should be developing changes against production data. At a minimum, data pipeline development should have three sets of environments:

One or more dev environments that engineers can test their changes against locally
A staging environment where the CI/CD pipeline runs all tests against before deploying to production
The production environment, against which only fully tested and approved changes run

Develop and maintain a style guide

Part of enabling collaboration on data pipelines means creating code that’s readable and understandable by others. A style guide brings clarity and consistency to all data pipeline code by laying out a set of conventions that all engineers should follow.

Document all workflows

Most data pipelines go undocumented, which makes it harder for others to understand:

How to modify them
How to use the resulting data
Where the data was derived from and how it was calculated

Documentation enables pipeline maintainers to understand how and why certain decisions in the code were made. It also makes it easier for data consumers to have confidence in the resulting datasets, which increases overall data trust.

Monitor data pipelines in production

The Operate and Observe phase of the ADLC acknowledges that a data pipeline isn’t “done” just because the code has shipped. Monitoring ensures that all new data pipeline code is running smoothly by testing in production, reporting errors, and tolerating and recovering gracefully from failure. It also ensures code is meeting defined performance metrics and is answering users’ requests in a timely fashion.

Optimize your pipelines based on your platform

All of the above advice applies to any approach to crafting ETL and ELT pipelines, regardless of the language or tools you use. Your toolset will usually support additional features that make developing and shipping revisions to data transformations faster.

In dbt, for example, you create new data transformations by creating a new data model. Within dbt, you can optimize your pipelines for better performance and lower cost in a number of ways:

Using the Defer to Production feature, engineers only need to rebuild models that they’ve changed locally
Engineers can also use rerun from point of failure to quickly test a fix to their in-development code
dbt supports different approaches to materializating your data to optimize both performance and storage costs

Enabling collaboration on ETL and ELT pipelines at scale

Building out the infrastructure to support these processes for ETL and ELT pipelines from the ground up requires significant engineering investment. This is made even more challenging by the heterogeneous nature of most corporate data environments, which may be using hundreds of different data sources.

dbt Cloud can serve as your company’s data control plan for managing ETL and ELT pipelines across the enterprise:

It’s natively interoperable across various cloud and data platforms, so you’re never locked-in
Its platform features support data developers and their stakeholders across various stages of the analytics development lifecycle to make data analytics a team sport
It provides the trust signals and observability features required to ensure all data outputs are accurate, governed, and trustworthy.

To learn more about managing data pipelines at scale with dbt Cloud, ask us for a demo today.

Published on: Dec 31, 2024

2025 dbt Launch Showcase

Catch our Showcase launch replay to hear from our executives and product leaders about the latest features landing in dbt.

Watch the launch event replay

Set your organization up for success. Read the business case guide to accelerate time to value with dbt.

Read now

Latest posts

Insights11 min

Empowering analysts with dbt: Who they are and how we help

Patrick Barch

on Jul 01, 2025

Insights10 min

Empowering the enterprise for the next era of AI and BI

Kathryn Chubb

on Jun 25, 2025

Learn15 min

Analyst autonomy vs data governance: How to have both