dbt
Blog The Analytics Development Lifecycle: Deploy

The Analytics Development Lifecycle: Deploy

For too long, the data world has shipped analytics code changes to production in an ad hoc manner. The result is often data errors, downtime for users, and wasted engineering effort.

It’s time we got smart about shipping pipeline changes to production. Using techniques borrowed from software engineering, we can take an approach to deploying changes that reduces human error and builds in numerous quality checks throughout the process.

In our latest series of articles, we’ve been covering how you can use the different stages of what we call the Analytics Development Lifecycle (ADLC) to ship high-quality analytics code. In this installment, we’ll look at the principles behind an automated approach to deploying data changes from developer’s boxes and into data consumer’s hands.

The basics of CI/CD

If you’re familiar with software engineering best practices, you may be familiar with the basic tenets of Continuous Integration and Continuous Deployment, or CI/CD. For those who aren’t, here’s a quick refresher:

Continuous Integration builds and tests changes to code as soon as it’s checked into a specific location of a source code version control system. When developers think they’re ready to shop their changes, they merge their changes from the branch of source code they’ve been working on into one that triggers a series of tests and sanity checks against their work.

Continuous Deployment pushes these changes automatically to production. To do this, the process first pushes them to pre-production environments, where another series of tests are run against testbed data. It then runs any migration procedures required to make the changes live, using telemetry to ensure everything’s operating within expected parameters.

CI/CD builds in numerous safeguards to the deployment process, such as:

  • Submitting all changes to review by another team member before going live
  • Testing changes in multiple environments before production release
  • Rolling back a change quickly and automatically if a deployment exhibits problematic behavior
  • Scoping changes to the smallest possible unit of release to limit the scope of potential errors

It turns out that these processes from software engineering map nicely to analytics code systems like dbt. dbt’s models capture all analytics code changes in text as SQL or Python code. This means we can capture all changes in source control, validate them with testing, and employ automatic processes to push vetted changes to production.

Principles of the ADLC Deploy phase

The exact steps of your Deploy phase will reflect the specific needs of your company. However, the following are universally attributes of any mature analytics code deployment process:

  • Managed in multiple environments
  • Size of change chosen by developers
  • Triggered by source code merge
  • Automated deployment
  • Downtime-free
  • Automated rollbacks

Managed in multiple environments

Pushing changes directly to production is a surefire way to break existing data workflows and applications. That's why, in the software world, dev teams deploy their changes through multiple environments, such as staging and testing, before pushing them live to end users.

We can emulate this best practice in the data world by creating multiple environments - at a minimum, a pre-production staging environment - to test our changes before pushing them live. The test environments contain mock data that emulates our real-world data as closely as possible.

Using this approach, data engineers can safely expand the scope and availability of their changes in a safe and controlled manner:

  • While developing the changes, data engineer can work on their own dev boxes/environments and isolated source code branches, ensuring their changes don't interfere with those that others are making
  • When ready to test, they can ship their changes to a shared test environment that may also contain other pending changes
  • Only after changes are thoroughly tested and vetted will engineers then make them available for decision-makers and other data stakeholders

Setting up isolated environments with enough realistic mock data to make testing worthwhile takes some upfront investment. However, this approach pays dividends in the long run. Defects are much more expensive to fix in production than they are to fix earlier in the development lifecycle. This means that, the earlier in the ADLC that you catch an error, the cheaper it is to fix.

Size of change chosen by developers

One hallmark of an immature analytics workflow is that teams will typically ship large changes - changes consisting of dozens of tables or updates to existing models. The problem is, the more you change, the more likely you are to inject errors.

By scoping their changes to smaller units - such as adding only a few tables or changing a couple of fields - data engineers create change sets that are more tightly scoped and easier to test. Instead of pushing out a huge batch of risky changes simultaneously, they can use the quality gates built into the ADLC - code reviews, automated testing, pre-production deployments, etc. - to vet a smaller set of changes in isolation.

Triggered by source code merge

A key benefit of source code control is that data engineers can work in their own branch or repo fork, isolating their changes from other changes in progress. If their changes aren't ready to promote your production, they can checkpoint their changes to their own branch, safe in the knowledge that they won't break anything in production or anything else that might be in flight.

Once they’re ready to promote their code, they can create a pull request (PR) to merge their branch into the main production branch. After successfully passing code review, the release process commences, triggering the next stage - an automated deployment.

Automated deployment

An automatic deployment immediately vets a set of changes for release to production. This increases the velocity of analytics code deployments, reducing the time and cost involved in deploying.

Automation ensures that the same steps are taken to verify and release every deployment. It removes human variability from the equation, reducing human error and improving the repeatability of the release process.

Once a PR is approved, changes will run against a pre-production environment, such as staging. If any tests or other integrity checks fail, the change is pushed back to data engineers for resolution. After verifying the change, the release process runs any and all required tooling - such as data migration - to make the change live for all data consumers.

Building an automated deployment process for analytics code changes can take time to develop and perfect. Platforms such as dbt Cloud that support CI/CD processes out of the box can greatly reduce the overhead of building custom deployment pipelines for your data plane.

Downtime-free

These additional processes may sound to some like unnecessary overhead. However, the goal of this process is to eliminate the far costlier overhead of production downtime.

In an immature analytics workflow, teams may frequently push out changes that haven't been properly vetted. This can result in pushing changes with basic errors - such as null values in required fields - that cause reports to break or data pipelines to cease functioning.

To be sure, your company wastes time and dollars any time data engineers must scramble to put Humpty Dumpty back together again. You may lose just as much or more time and money, however, through delayed business decision-making - or, even worse, inaccurate business decisions based on bad data.

A mature analytics workflow builds repeatability and quality control into the deployment process, incorporating lessons learned from past deployments to detect errors before they impact users.

Automated rollbacks

Pre-production testing is a great way to identify and resolve foreseeable data errors. Production systems are complex, though. While the goal is to eliminate all errors in production, it's impossible to predict every edge case.

When this happens, it's critical to have an automated rollback mechanism for changes. Teams should identify a subset of their test bed as smoke tests that get run on production data on a regular basis. If the system detects an error, this should trigger alerts and notifications. Data engineers can then revert their changes in production while they identify the root cause.

What’s next

Pushing your changes to production doesn't mean you're done with the ADLC. In the next installment of our series, we’ll move to the operations phase of the process, which is critical to ensuring your changes run smoothly.

Last modified on: Jan 28, 2025

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts