dbt
Blog Why analytics engineering and DevOps go hand-in-hand

Why analytics engineering and DevOps go hand-in-hand

Dec 04, 2024

Learn

Scaling your data estate can be daunting. Organizations with many projects might find themselves in a logistics nightmare trying to maintain consistency across hundreds of data projects manually.

It doesn't have to be that way. By allowing analytics engineers to leverage patterns in existing tools and embrace concepts from DevOps, you can streamline scaling your data estate on dbt Cloud while avoiding too much pressure on your DevOps engineers.

Duet Technologies, an early-stage company that’s created the first provider network for Nurse Practitioner-owned practices, found they could scale more easily by bringing their analytics engineers into the DevOps world. Let’s look at how they did it.

From analytics engineering to infrastructure

Duet Technologies aims to support Nurse Practitioners (NPs) in transforming and easing access to primary care—helping people, especially in underprivileged and under-resourced areas, find NPs who can provide them with care while helping to tackle the challenges that NPs encounter while running their practices.

The founding Analytics Engineer at Duet, Katie Clairborne, is an analyst turned engineer. She started her career as a business analyst working in spreadsheets before she moved on to data analysis using tools like Tableau.

From there, she was introduced to the world of engineering via the version control available in dbt Cloud. Not stopping there, she continued to learn about DevOps and CI/CD using GitHub Actions while also becoming familiar with Infrastructure as Code with tools like Terraform.

Katie believes that tools like dbt Cloud’s Visual Code Editor can be very effective for learning engineering topics such as the building blocks of SQL. Further, she argues that analytics engineering isn’t just an entry point into the world of software engineering. It can even be a jumping-off point for further specialization into areas like infrastructure engineering and DevOps.

Analytics engineers may not realize it, but they already have the skills they need to solve their infrastructure problems, such as scaling their dbt Cloud deployments.

Framing the problem

When you hear phrases like “multi-project deployment” or “multi-project collaboration,” dbt Mesh might be the first thing to come to mind. However, multi-project collaboration requires multiple sets of cloud resource configurations. And those configurations need to be consistent.

Consider a single dbt Cloud project. This project will have at minimum two environments—one for development and one for production.

In practice, however, there may likely be other intermediate environments, such as a staging environment. Each of these environments will also have one or more jobs.

There will probably be a CI job that runs as part of a pull request. However, there may also be a merge job that runs when the code changes are merged into main. There could even be a production deployment job that’s triggered via an API call by GitHub release.

Setting up just a single project requires a significant number of steps for each environment:

  • Naming the project
  • Configuring the directory
  • Setting up external connections such as BigQuery
  • Setting up service accounts or other credentials on the external service side
  • Getting the authorization setup on those correctly
  • Connecting a git repository

Each of these is also unlikely to be a single action. Each may have its own number of small steps. When Katie did this, she counted 225 steps over about 15 minutes to configure a single dbt Cloud project manually. (And she was already very familiar with dbt Cloud.)

When put into the perspective of an organization that could have five, ten, 15, or even more projects, this is a lot of manual steps. That’s plenty of chances to make mistakes. The likelihood that an individual or team can consistently execute these correctly for every single project isn’t high.

Solving the problem via people, processes, and technology

This raises the question of how we can set up and maintain these projects in a more automated, sustainable fashion. To accomplish this, Katie relied on the three pillars of DevOps: people, processes, and technology.

People

Katie believes that organizations should give analytics engineers the opportunity to extend concepts they’ve learned from dbt. Integrating analytics engineers into a DevOps function, she says, is the most efficient use of an organization’s personnel.

Those analytics engineers may not be able to solve all of those problems independently. However, they also don’t need to hire or borrow a DevOps engineer to support their analytics engineering efforts.

Analytics engineers bring an innate understanding of the problem they’re attempting to solve. Their skill sets transfer well to infrastructure, even if the specific tooling may be new to them.

A DevOps team provides an organization-level framework within which analytics engineers can contribute. The team can help them with guidance, best practices, and standardized workflows.

Processes

Optimization at the right time is crucial.

  • Optimize too early and you risk wasting time and effort building features you may end up not using.
  • On the other hand, not optimizing at all has a different set of downsides: teams may get better at executing all of these steps manually, but they’re simply deferring high maintenance costs.

Katie recommends applying the rule of three and optimizing around the time of the third project. She suggests this is when teams have a decent idea of what they will benefit from. Thus, they can spend the extra time optimizing to gain dramatic improvements that’ll pay dividends in the future.

Tools

Lastly, Katie encourages using tools like Terraform to manage cloud resources programmatically.

A scalable data project system needs to reduce both project setup time and maintenance costs. To this extent, ideally, we want to manage dbt Cloud resources programmatically.

dbt Cloud Terraform Provider is exactly the tool for this. It was originally developed by a community member known as Gary James, but now it’s a fully official piece of software maintained by dbt Labs.

A Terraform provider is a Terraform plugin, like adapters in dbt. Just as you might use a Snowflake or BigQuery adapter in dbt, you can use a Google Cloud or dbt Cloud provider in Terraform.

Using Terraform to transform dbt Cloud transformations

Both dbt and Terraform need to be told how to connect to data platforms or cloud providers. In dbt, this varies depending on the product and environment being used.

dbt Cloud has account-level connections for service accounts and personal credentials for developers. If developers are using the dbt Cloud CLI, they might choose to download those credentials. This is very similar to dbt Core’s profile.yml file.

In Terraform, these connections are configured using a provider declaration block that includes an account ID, a dbt Cloud service token, and a dbt Cloud host URL:

provider “dbtcloud” {

account_id = var.dbtcloud_account_id

token = var.dbtcloud_token_id

host_url = var.dbtcloud_host_url

}

Once you’ve set up the provider, you need to configure your source code as a dependency. Just as before, both dbt Cloud and Terraform do this in similar ways.

In dbt, you use the dbt deps command to install packages. Likewise, Terraform provides a terraform init command that installs providers and modules. If you look at the file structure of both tools, they’re very similar: a subfolder for installed dependency code, a lock file for consistency, and a configuration file.

When defining resources in Terraform, you declare data sources as inputs—just as you’d use sources in dbt to create models. Similarly to previewing the model SQL in dbt before running that model, you can use the Plan command in Terraform to see the infrastructure changes that will occur when you run the Apply command.

Likewise, both dbt and Terraform produce artifacts in the form of files that describe their state at a moment in time. These are the manifest.json or the Terraform state file, respectively. All of the resources are listed as well as various attributes about them. Both also have commands for building graphs and state comparison, as well as executing over only changes elements.

How this helps with scaling dbt Cloud

If we go back to our description of environments and jobs within a dbt Cloud project, we could have a TF file for environments —one for jobs and one for project-level resources.

However, if we’ve got a large number of dbt Cloud projects, each with its own code repository, imagine how this could be handled.

While you could copy these files into each repository, consider what would happen if you needed to make a broad-scale change as an organization to the way these projects are configured. You would need to go into each repository and make the same changes over and over, then validate that it was made correctly.

In dbt, you can use macros with arguments to avoid repetition. You can do effectively the same thing with modules in Terraform. Modules provide a way to create multiple related resources. Given the example below:

module “dbtcloud_project” {

source = “modules/dbtcloud_project”

project_name = “coalesce”

}

The specified source module defines all the resources a dbt Cloud project needs to get started. The project name is provided as a variable. Each team can then call this module in their repository. This creates a pre-configured general structure that the entire organization can use while retaining the flexibility to configure a subset of elements differently if needed.

When Katie did this, the manual procedure that previously took 15 minutes was completed in just under a minute.

Conclusion

This isn’t even everything that’s available with dbt Cloud for DevOps automation. There are things like dbt-jobs-as-code, which uses YAML files to define dbt Cloud Jobs. dbt Cloud Terraforming generates Terraform configuration files from existing dbt Cloud configurations, making it easier to incorporate Terraform.

By treating dbt Cloud infrastructure as code, we simplify both project setup and project maintenance. Creating an easy way to tear down, adjust, and recreate those projects can be incredibly valuable. Additionally, you can transparently maintain consistency across projects.

Curious to see how you can scale your data operations with dbt Cloud and DevOps? Watch the full presentation to learn more.

Last modified on: Dec 04, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts