Staging Models Best Practices and Limiting View Runs
Aug 15, 2023
LearnWe’ve received questions during dbt training sessions and in the dbt Slack community about staging models and materializing views. In this post, I want to help answer these questions.
More specifically, we commonly get asked:
“Should I get rid of staging models to reduce successful models built counts?”
“How can I exclude unchanged views in my dbt Cloud job runs to optimize them?”
For this post, I want to reiterate our dbt project structure recommendations and provide a solution to optimize your cloud data platform spend and shorten job times.
Structuring projects
One of our most read guides is titled, “How we structure our dbt projects”. In the guide we advise structuring your dbt project with “three primary layers in the models directory, which build on each other.”
- Staging — creating our atoms, our initial modular building blocks, from source data
- Intermediate — stacking layers of logic with clear and specific purposes to prepare our staging models to join into the entities we want
- Marts — bringing together our modular pieces into a wide, rich vision of the entities our organization cares about
We continue to encourage dbt users to follow this structure, especially establishing a staging layer mainly composed of views, not tables. Why?
Think about it this way - the first time you need to change one thing about data from an un-staged data source that permeates through your project, you either have to make that change in multiple downstream models or invest up front in building and maintaining a staging layer. A staging layer is an investment, one that can have direct and indirect impacts on the cost of your data.
We have seen across our customers how staging models help reduce code duplication, storage, and code drift. You can read more about our staging model recommendations in our documentation.
Excluding views in a dbt Cloud job
With the first question answered, what if you want to exclude models configured as views from your job runs?
To omit running views you have several options that depend on your requirements.
To simply exclude running views for a dbt Cloud job:
- Go to your current production deployment job in dbt Cloud
- Modify your command to include:
--exclude config.materialized:view
- Save your job changes
When excluding views in your run jobs, you need to consider a couple things before following this:
- If you have views that include macros that dynamically generate code (i.e. case statements) based on upstream tables, these will need to be run each time to account for new values.
- If you need to still test your views with each run, you’ll need to check out the next tip. Using the above command will exclude views from being materialized and from running tests.
Excluding views while still running tests
To allow you to exclude views from your job run while still running tests, you can follow these steps to create a custom selector for your job command. Hat tip to dbt Community member, Nicholas Yager, for this suggestion.
- Open your dbt project in the dbt Cloud IDE
- In your project top level folder, add a file called
selectors.yml
- In the file put the following code:
selectors:
- name: skip_views_but_test_views
description: >
A default selector that will exclude materializing views
without skipping tests on views.
default: true
definition:
union:
- union:
- method: path
value: "*"
- exclude:
- method: config.materialized
value: view
- method: resource_type
value: test
- Save the file and commit it to your project
- Modify your dbt Cloud jobs to include
--selector skip_views_but_test_views
This custom selector will skip running views in your project, but will run tests for views. Running tests for views each time can help keep data quality intact and save you from the need to rerun failed jobs.
Building only changed views
Let’s take this one step further, what if you want to exclude models configured as views that did not change since the last job run?
Let’s create a merge job that gets triggered when code gets merged into main:
- Create a new dbt Cloud job, call it “Merge Job”
- Set the environment to your default production environment
- Set Commands to:
dbt run -s state:modified+
- Note: we’re not using dbt build because we already used Slim CI to both run and test our code that just got merged into main, so it’s unnecessary to execute a build command in this context.
- Note: we’re not using dbt build because we already used Slim CI to both run and test our code that just got merged into main, so it’s unnecessary to execute a build command in this context.
- Under the execution settings, select the default production job to compare changes against:
- Defer to a previous run state — Select the “Merge Job” you created so the job compares and identifies what has changed since the last merge.
- Defer to a previous run state — Select the “Merge Job” you created so the job compares and identifies what has changed since the last merge.
- In your dbt project, follow the steps in this guide to create a script to trigger the dbt Cloud API to run your job after a merge happens within your git-repository or watch this video
The purpose of the merge job is to:
- Immediately deploy any changes from PRs to production
- Ensure your production views remain up-to-date with how they're defined in your codebase while still ensuring you can remain cost-efficient when running jobs in production.
The merge action will optimize your cloud data platform spend and shorten job times, but you’ll need to decide if making the change is right for your dbt project.
This is a bit of work to set up and test, but the benefit is worth it since we see 20-40% of customers’ runs are views.
Keep asking
All of us at dbt Labs are listening to the dbt Community feedback. We’re doing our best to make decisions that align with our values and building the future that we believe in for dbt and dbt practitioners.
Reach out to any of us in the dbt Community Slack if you have questions about how to implement in your dbt Cloud environment.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.