dbt
Blog How to track data changes with dbt snapshots

How to track data changes with dbt snapshots

You may have come across "Type 2 Slowly Changing Dimensions" or this idea of "change tracking" within your data warehouse. Regardless of the nomenclature, all data analysts have faced the problem--how do I model data when the record in the data warehouse keeps changing? In this post, I'll cover how to solve this using dbt snapshots.

A tale of two data types

Source data tables generally come in two different flavors:

  • Mutable: Records are updated in-place over time. A typical example is an orders table, where the status column changes as the order is processed.
  • Immutable: Once a record is created, it is never updated again. A typical example is clickstream data, like a page_viewed or link_clicked . Once that event is recorded, it won't be updated again.

Applications often store data in mutable tables. The engineers that design these applications typically want to read and modify the current state of a row -- recording and managing the historical values for every row in a database is extra work that costs brain power and CPU cycles.

For analysts, however, mutable data can be difficult to work with. It's best explained with an example.

Imagine you have an orders table where the status field can be overwritten as the order is processed. One day you query the table and get back this result:

order_id status created_at updated_at
1 pending 2019-01-01 2019-01-01

order_id status created_at updated_at 1 pending 2019-01-01 2019-01-01

A day later, that order goes from "pending" to "shipped". Now, your query gives you back the following:

order_id status created_at updated_at
1 shipped 2019-01-01 2019-01-02

order_id status created_at updated_at 1 shipped 2019-01-01 2019-01-02

Since this record was mutated, the information about the order in the pending state has been overwritten. Straight away, you've lost the ability to answer questions like "when did that order ship?", and "how long did it take to change status?".

If you've ever had a KPI for last week change out from under you, or if a member of the finance team tells you that the values they exported to Excel no longer match what your BI tool is saying, you're probably experiencing the pain of mutable data. Working with mutable data can make you feel like you're building your analytics on top of quicksand.

You realize that all your problems would be solved if there were a way to turn your order data into immutable records, tracking the different values in your table over time. You ask your API team, "can we build an order_history table in the backend?", but get told you'll have to wait months before they can look into it.

Fortunately, this is where the dbt's snapshots come in -- they make it possible to generate a stream of record changes from a source table, effectively turning mutable data into immutable data. As an analyst using dbt, you can add snapshots to your dbt project in an afternoon, without the need to wait on external engineering resources from your core product team.

How do dbt snapshots work?

Snapshots are simple select statements which define a dataset. Every time you run the dbt snapshot command, dbt will run your select statement and check if the dataset has changed compared to its last known state. If the records have changed, then dbt will create a new entry in the snapshot table for the new state of the record. If there are net-new records in the source dataset, then dbt will create an initial entry for the record.

order_id status updated_at dbt_valid_from dbt_valid_to
1 pending 2019-01-01 2019-01-01 2019-01-02
1 shipped 2019-01-02 2019-01-02 null

order_id status updated_at dbt_valid_from dbt_valid_to 1 pending 2019-01-01 2019-01-01 2019-01-02 1 shipped 2019-01-02 2019-01-02 null

Behind the scenes, dbt runs all the appropriate DDL and DML to insert, update, and merge records into your snapshot table. If new columns appear in your source query, then dbt will add these columns to your snapshot table.

The particular mechanism for tracking changes that dbt uses is often referred to as "Type 2 Slowly Changing Dimensions".

What the heck is a Type 2 Slowly Changing Dimension?

If you're already familiar with the term, jump ahead! If the term is new to you, or if it you've always wondered what it meant, let's work through it step by step:

  • "Dimension", in Kimball parlance, is a data model where a single record represents a person, place, or thing -- customers, products, locations, etc. This is in contrast to a "fact" which represents a process -- orders, web sessions, transactions, etc.
  • Dimensions are often considered "Slowly Changing" because their details update unpredictably and over a long period of time -- a customer changes their address, a product is renamed, or an agent is assigned to a new team. Facts, on the other hand, tend to have a point in time at which they are no longer updated. Once an order is received, the order won't be updated again; once a user leaves her computer, the web session is finished.
  • "Type 2" because there are a number of ways to handle mutable data, and the approach that dbt uses (i.e. inserting changed records, with valid_from and valid_to metadata columns) has been arbitrarily defined as the second type -- you can check out the other ways of handling mutable data here.

Tracking data changes

dbt ships with two different strategies for determining if rows have changed in a table. In both cases, dbt relies on a provided unique_key to match rows in your snapshot query to rows in your snapshot table. Each strategy is examined in further detail below.

Timestamp (preferred)

The timestamp strategy uses an updated_at field to determine if a row has changed. When you run the dbt snapshot command, dbt checks whether the updated_at columns for a row is more recent than the last time the snapshot ran. If it is, then dbt will invalidate the old record and insert the new one. If the timestamps are unchanged, then dbt will not take any action, as the row has presumably not changed.

Check

Some data sources do not include a reliable timestamp that indicates when a record has changed. If this is the case for your data, you can instead use the check strategy. This strategy works by comparing the values of a list of columns between their current and snapshotted values. If any of these column values have changed, then dbt will invalidate the old record and insert the new one. If the column values are identical, then dbt will not take any action.

What should I snapshot?

Snapshots, like models, are defined as select queries. You can use source and ref in these queries like you would in any model query. With this flexibility you can snapshot effectively any part of your dbt DAG.

Snapshots should almost always be run against source tables. Your models should then select from these snapshots, using the ref function. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data. This is an important rule, but it's not one that dbt enforces.

So, why run snapshots against source tables? Why not snapshot the results of models?

Snapshots, by their very nature, are not idempotent. The results of a snapshot operation will vary depending on if you run dbt snapshot once per hour or once per day. Further, there's no way to go back in time and re-snapshot historical data. Once a source record has been mutated, the previous state of that record is effectively lost forever. By snapshotting your sources, you can maximize the amount of data that you track, and in turn, maximize your modeling optionality.

By contrast, a snapshot that runs on top of a model will record changes to "transformed" data. If the logic in a model upstream of a snapshot changes, then the snapshot operation will record different data. In some cases this can be desirable, but for typical modeling use cases, it presents an opportunity for data loss. Logic errors in data models are inevitable, but dbt works hard to make sure that you can fix these bugs and rerun your dbt project. This isn't the case if you have a snapshot in the middle of your DAG: that bug will likely result in data loss.

Getting started with snapshots

Snapshots are so powerful because they generate new information. The earlier you add snapshots to your dbt project, the more you can make use of their power. The docs are the best place to go for all the info about adding snapshots to your dbt project -- make sure you check out the section on "Snapshot queries" to understand the best way to set up your snapshot.

With great power comes great responsibility. The usefulness of snapshots relies on them being run regularly and reliably -- once you've added snapshots to your project, make sure you have a way to run the dbt snapshot command on a schedule (it's a good idea to schedule this job separately to your dbt run ). Further, ensure that you have monitoring and alerting set up to find out quickly if something goes wrong. We use dbt Cloud to schedule our snapshots and leverage Cloud's built-in email and Slack notifications for alerting.

If you have mutable data sources in your dbt project, you should make the time to set up snapshots -- it only takes an afternoon! Remember:

The best time to start snapshotting your data was twenty years ago. The second best time is today.

Already added snapshots to your dbt project? Check out our tips on building models on top of snapshots!

Last modified on: Oct 15, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts