dbt
Blog Building an ETL pipeline: Best practices

Building an ETL pipeline: Best practices

Data transformation is the heart of data quality, turning raw data from multiple sources into high-quality information that business decision-makers can mine for insights.

At the heart of the data transformation process is the data pipeline. Most data engineers are managing and maintaining dozens of such pipelines across dozens or more data sources. Running these with high performance and high quality requires implementing a set of best practices to manage changes, facilitate collaboration, and push changes to production safely.

The traditional approach to implementing data pipelines is ETL—Extract, Transform, and Load. We’ll look at both ETL and its modern successor, ELT (Extract, Load, Transform), and look at the key best practices you should implement to ensure high-quality data transformations.

ETL vs. ELT: Identifying the “best” workflow for data

ETL and ELT are sometimes used interchangeably these days. However, historically, they take slightly different—but significant—approaches to working with data.

Both processes take raw data from one or more sources and apply a series of data transformations to create a new dataset in a target system. The difference lies in when the transformation occurs:

  • ETL transforms data, often semi-structured or unstructured, before loading it into the target system
  • ELT, by contrast, loads the raw data, often from other relational or data warehousing systems, into the target system and then transforms it in-place

For the most part, in a cloud environment, you’ll want to use an ELT workflow, which is what most organizations are doing today. There are a few reasons for this:

  • Scalability
  • Flexibility
  • Cost efficiency
  • Data democratization

Scalability

Compared to ETL, ELT scales better because it takes advantage of the processing power of cloud-native systems like Snowflake and Redshift. Once data is loaded, these systems can transform it efficiently at scale.

ELT also generally handles complex datasets better—and faster—than ETL. In ETL, the larger the dataset becomes, the harder it is to manage, as the data must be fully transformed before it can be stored in its final form.

Flexibility

In ETL, the source data is transformed en route, and only the transformed version ends up in the data warehouse. This can lead to a couple of issues:

  • If the original data isn’t preserved anywhere, it’s lost forever. That can have severe implications for compliance and data governance, especially if errors are discovered in the pipeline.
  • The new data format might not be the ideal format for all teams.

ELT preserves more flexibility and accountability by loading the original data first. This guarantees that you always have a copy of the original. It also enables using the data for different use cases.

Cost efficiency

ETL has historically required specialized hardware and software to implement. By contrast, ELT makes use of the native capabilities of cloud systems, standing up compute resources (like virtual servers) when transformation work is needed and shutting them down immediately afterwards.

Data democratization

In ETL, no one except the data engineering team typically has access to the original data until it gets loaded into the data warehouse in its final form.

In ELT, the original data is immediately available in the data warehouse. This enables other teams to discover it and build their own data pipelines immediately, without waiting on the centralized data engineering team to act on a support ticket.

Best practices for ETL and ELT pipelines

For the most part, then, a key best practice for ETL pipelines is: Use ETL. There are, however, a few exceptions:

  • You’re dealing with a small dataset that requires complex transformations
  • You have to integrate with a legacy or third-party system that works better with an ETL approach
  • Scenarios such as Internet of Things (IoT), where you want to perform transformations to combine data from different formats before storing it, or you need to de-duplicate entries

Whether you use ETL or ELT, however, there’s a set of best practices you can follow to improve the quality, reliability, and performance of your data transformations.

Have a workflow process in place

Many data pipelines—even today—are driven primarily by engineering and managed on an ad hoc basis. There are few processes in place to establish SLAs, version pipelines, and respond to data pipeline issues. That leads to:

  • Data that diverges from business requirements
  • Untested (or undertested) data changes pushed to production
  • Broken pipelines and outdated reports that lead to impaired trust in data
  • Slow decision-making velocity

We originally created dbt to address what we saw as deficiencies in the data workflow process. We sought to bring some of the best practices from software engineering - version control, deployment pipelines, testing, and documentation—into the data world.

Today, we’ve expanded that mission by championing what we call the Analytics Development Lifecycle (ADLC). Like its counterpart, the Software Development Lifecycle (SDLC), the ADLC aims to break down the barriers between data team members—data engineers, analysts, and business decision-makers. It ensures each new data pipeline or pipeline improvement aligns with core business objectives and has measurable key performance indicators (KPIs).

From a quality standpoint, the ADLC focuses on creating small, well-tested changes that engineers promote to production with high velocity. Once changes are in production, they’re monitored, operationalized for performance, and made available for analysts and business users to discover in a self-service manner.

Use version control and enable collaboration

A major source of errors in ETL and ELT pipelines is the lack of a central repository for a data pipeline’s source code. At a minimum, all pipelines should be tracked and versioned using a version control system. That guarantees that all data changes are represented as code and can be discovered, changed, or even reverted as needed.

Besides enabling tracking changes, version control enables collaboration among data engineers, analytics engineers, and other technically savvy data users. It provides a single source of truth that people can discover, modify, and submit for approval. That enables more people across the company to contribute to building and maintaining data pipelines.

Build a culture of quality around data workflows

According to Monte Carlo, half of all companies surveyed said that 25% of their revenue is impacted by poor data quality. That speaks to a need to build quality controls and measures into every part of the ADLC.

Part of building a culture of quality means that engineers should be creating tests for all new data pipelines and for any changes they implement. Equally important, however, is implementing a version control-based process for enforcing quality checkpoints:

  • Engineers should work in their own source control branches, separate from the “main” branch containing production code
  • After testing their changes locally, engineers create a pull request (PR) to push code into the main branch
  • The PR triggers a manual review of their code, as well as an automatic run of any tests they’ve written
  • Once approved, a Continuous Integration/Continuous Deployment (CI/CD) process further tests and deploys their changes automatically to production

Keep separate environments for each part of the dev process

No one should be developing changes against production data. At a minimum, data pipeline development should have three sets of environments:

  • One or more dev environments that engineers can test their changes against locally
  • A staging environment where the CI/CD pipeline runs all tests against before deploying to production
  • The production environment, against which only fully tested and approved changes run

Develop and maintain a style guide

Part of enabling collaboration on data pipelines means creating code that’s readable and understandable by others. A style guide brings clarity and consistency to all data pipeline code by laying out a set of conventions that all engineers should follow.

Document all workflows

Most data pipelines go undocumented, which makes it harder for others to understand:

  • How to modify them
  • How to use the resulting data
  • Where the data was derived from and how it was calculated

Documentation enables pipeline maintainers to understand how and why certain decisions in the code were made. It also makes it easier for data consumers to have confidence in the resulting datasets, which increases overall data trust.

Monitor data pipelines in production

The Operate and Observe phase of the ADLC acknowledges that a data pipeline isn’t “done” just because the code has shipped. Monitoring ensures that all new data pipeline code is running smoothly by testing in production, reporting errors, and tolerating and recovering gracefully from failure. It also ensures code is meeting defined performance metrics and is answering users’ requests in a timely fashion.

Optimize your pipelines based on your platform

All of the above advice applies to any approach to crafting ETL and ELT pipelines, regardless of the language or tools you use. Your toolset will usually support additional features that make developing and shipping revisions to data transformations faster.

In dbt, for example, you create new data transformations by creating a new data model. Within dbt, you can optimize your pipelines for better performance and lower cost in a number of ways:

Enabling collaboration on ETL and ELT pipelines at scale

Building out the infrastructure to support these processes for ETL and ELT pipelines from the ground up requires significant engineering investment. This is made even more challenging by the heterogeneous nature of most corporate data environments, which may be using hundreds of different data sources.

dbt Cloud can serve as your company’s data control plan for managing ETL and ELT pipelines across the enterprise:

  • It’s natively interoperable across various cloud and data platforms, so you’re never locked-in
  • Its platform features support data developers and their stakeholders across various stages of the analytics development lifecycle to make data analytics a team sport
  • It provides the trust signals and observability features required to ensure all data outputs are accurate, governed, and trustworthy.

To learn more about managing data pipelines at scale with dbt Cloud, ask us for a demo today.

Last modified on: Mar 27, 2025

Early Bird pricing is live for Coalesce 2025

Save $1,100 when you register early for the ultimate data event of the year. Coalesce 2025 brings together thousands of data practitioners to connect, learn, and grow—don’t miss your chance to join them.

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Recent Posts

Great data professionals never work alone

Every industry leader understands one thing: you need the right network to grow. The dbt Community connects you with 100,000+ data professionals—people who share your challenges, insights, and ambitions.

If you’re looking for trusted advice, expert discussions, and real career growth, this is the place for you.

Solve your toughest challenges

Join today and get real-world advice from experienced pros.

Expand your network

Foster connections with meetups, local groups, and like-minded peers.

Advance your career

The dbt community is full of learning opportunities and shared job postings.