How dbt Labs’ data team approaches reverse ETL
Apr 08, 2022
LearnYou’ve likely heard a lot of movement and interesting chatter around operational analytics, reverse ETL, and data activation in the past year or so. Reverse ETL is the process of syncing data from your data warehouse to where business users need the data.
Instead of only exposing data in a business intelligence (BI) tool, reverse ETL pipelines allow data to be synced down to ad platforms (Facebook, Linkedin), sales CRMs (Salesforce), customer service platforms (Zendesk, Kustomer), and other destinations. There are a variety of platforms, such as Hightouch, Rudderstack, and Census, that help data teams harness the power of reverse ETL. These tools allow data teams to create automated syncs that send data from their data warehouse to multiple end destinations. Census and Hightouch have also built integrations with dbt that make it easier to build a complete pipeline, from transformation through reverse ETL.
With reverse ETL pipelines, business users can now have accurate, data-team approved, and automated data in the tools they use every day.
What role does reverse ETL play for the Data Team?
Reverse ETL plays two primary roles for the Data Team at dbt Labs:
1. To provide supplemental and accurate data in business tools.
We want users to have the most accurate and useful data to use in their systems in an automated way. Business users shouldn’t have to worry about manually entering fields that can easily change; instead, that data should be populated by our version-controlled, QA’d, and peer-reviewed dbt models. To do this, we leverage reverse ETL pipelines to create automated workflows that send data down data to a variety of end business tools that we specify as exposures in dbt.
2. A method to put data in the tools where people work day-to-day.
Business users are comfortable and confident with the systems they use in their daily work. Data activation efforts allow us to move data to the places where users are at their best: able to navigate systems easily and perform actions quickly. Reverse ETL allows us to put the data business folks need directly into the places they spend the most time.
Some areas we’re currently using reverse ETL for include:
- Sending dbt Cloud project information, such as which version of dbt an account is using, to create appropriate in-app notifications for folks to upgrade their projects to the latest version.
- Syncing a record of our customers to our sales CRM and email platform. This ensures we have the most up-to-date and data-team-approved values there. We can additionally segment and create separate flows for users in our email platform.
- Sending some of our website’s pageviews and conversion events data down to an analytics tool to perform more robust analysis, such as funnel and retention analysis.
Our approach
We tackle reverse ETL projects with the following step-based approach:
- Understand stakeholder expectations: What’s the problem we’re solving for here?
- Identify the data models and fields needed to be pushed to the tools
- Transformation time! Create export models to power reverse ETL pipelines
- QA it up, create syncs in our reverse ETL tool, and involve stakeholders for training
- Setup dbt exposures for governance and documentation
- Establish alerts for reverse ETL pipelines to provide transparency to end business users
Below, we’ll walk through these steps in greater detail using a user segmentation example with the jaffle shop, a sample dataset and dbt project.
Before we jump into that, a few prerequisites you may need to create and manage reverse ETL pipelines are:
- Data stored in a data warehouse
- A reverse ETL tool to perform syncs from your data warehouse to your end platforms
- An end platform that you’re looking to push data to, such as a sales CRM or ad platform
Step 1: Understand stakeholder expectations
It goes without saying: most end business users are going to initially ask for the data in their business tool as real-time as you can get it. This is where we use our collaboration skills and data knowledge to help identify the appropriate cadence for the project.
We first like to try to understand the cadence at which business users would ideally like to understand three things:
- How often do we need to ingest data for the data models required to build this export model?
- How often do we need to run our dbt Cloud jobs for upstream models required to build export models?
- How often do we need to run syncs in our reverse ETL platform to the end tool?
To understand the appropriate cadence for both business stakeholders and data capabilities, it’s important to identify the true goals of your end business users. Why do they think they need the data at this cadence? How would the decision they make differ if data was refreshed every 30 minutes versus every three hours?
As analytics engineers, we also have some gut reactions that help us drive that conversation with stakeholders. We know which fields and dimensions change relatively quickly and conversely, which fields don’t change that often. Using that mixture of stakeholder wants and our instinctual knowledge helps us establish the realistic SLA for data in the end platform.
Step 2: Map out entities & fields
Now we’re getting into the fun stuff! At this stage, we need to understand the data and dbt models needed to really build export models. We do the following forms of planning: entity matching and field identification.
Entity matching
It’s important to understand how the grains differ between our existing dbt models and the end business tools’ system. For example, how does a Salesforce contact match or map to a dbt Cloud account? This is the type of question we solve in this stage of planning.
Field identification
At this stage, we need to determine if we have all the fields already modeled out or if we need to create new fields for the project. Most of the data business users want exported down are derived from our fct_
and dim_
models. Oftentimes, the fields stakeholders want in their downstream tool already exist in our data models, but may need to be cast, renamed, or grouped differently.
Step 3: Create export models
Just as we have marts
, staging
, and metrics
directories for our dbt project, we create a separate models/exports
directory for all of our data models we want to sync to business tools. For every existing data model we want to sync out to different destinations, we have one export
model.
For example, if we have a customers
model like in jaffle shop that we want to be synced down to multiple destinations, we would create a single export_customers.yml
model.
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│ ├── export
│ │ └── export_customers.sql
│ └── marts
│ └── staging
│ └── metrics
The primary goal of these export models is to supplement any existing data or modeling in our fct_
and dim_
models required by stakeholders in the structure that the tool we’re syncing to expects. I just said a lot of words, so let’s break down what I really mean by this.
Our fct_
and dim_
tables stored in our models/marts
directory form the backbone of analytics at dbt Labs. An immense amount of work goes into making our fct_
and dim_
tables what they are:
- Closer to the end of the DAG
- Have undergone considerable transformation
- Likely experienced some entity joins to get them in their final shape
- True and proven: these models have undergone immense testing and peer review to ensure the data they hold are accurate and usable
Our ultimate belief: Export models shouldn’t be doing the heavy lifting that happens for fact and dimensional models.
I also said that another goal of export models is to structure the data in a way the downstream business tool expects. What we might store as a float in our data warehouse may need to be passed as an integer in an end business tool. To understand if any casting or object creation needs to take place in an export model, we read the documentation provided in our reverse ETL tool or the actual destination’s API documentation.
As a result, these are the types of transformation that typically happen in our export
models:
- Casting to specific types
- Simple aggregations and summations
- Aliasing fields
case when
statements to make fields more readable for end business users- Constructing arrays or objects
A good rule of thumb for figuring out what you need to model in a parent model or in an export model is to look at use cases: are you going to need these additional transformations outside of the end destination? If so, it should probably be modeled in an upstream model.
It’s also good to callout that there is a lot of select’ing that happens throughout our export models. In fact, several of our export
models are simple select from upstream_table
statements. We prefer to send down all possible fields to give stakeholders maximum flexibility down the road; by giving stakeholders all the options they need, they’re not reliant on the data team to add new fields or create new reverse ETL pipelines.
It’s important to note that we materialize all export
models as views. There are a few reasons we do this:
- All parent models for export models are either fact or dim models that are always materialized as tables or incremental models. This allows the view materialization for export models to be more performant.
- Materializing export models as views helps ensure that the data that is piped down to tools is as up-to-date as possible. We don’t have to worry about creating separate dbt Cloud jobs to rebuild these models as tables.
A customer segmentation example
As we mentioned earlier, a common use case for reverse ETL is to send customer information down to email platforms to support better customer segmentation. For this scenario, we can use the jaffle shop’s customer
table to create a new export model. While we’re using the jaffle shop here as an example, it represents a similar flow to what we do internally for syncing to our email platform.
The customers
table here has seven columns: customer id, first name, last name, first order date, most recent order date, total number of orders, and customer lifetime value (CLV).
All of these fields could potentially be used by business users in an email platform–maybe they want to pass in a customer’s number of orders into the email subject line to make the emails more personalized or create different customer segments based on CLV for a loyalty program. As a result, we’re going to make sure we pass down all possible fields. In addition, the marketing team requested a few features about this export model:
- The marketing team wants to create a new segment,
buyer_type
based on the number of orders a customer has placed - The team is interested in having the most recent order date and first order date extracted out to the month-level so that they can contact new, bounced, and longtime loyal customers with different messaging
Use the following code to create an export_customers.sql
model that lets your friends on the marketing team go wild and free with email segmentation.
with import_customers as (
select * from {{ ref('customers') }}
),
segment_customers as (
select
customer_id,
first_name,
last_name,
date_trunc('month', first_order_date) as first_order_month,
date_trunc('month', most_recent_order_date) as
most_recent_order_month,
number_of_orders,
customer_lifetime_value,
case when number_of_orders >= 3 then 'super_purchaser'
when number_of_orders >=1 and number_of_orders < 3 then
'regular_purchaser'
else 'non_purchaser'
end as buyer_type
from import_customers
)
select * from segment_customers
As you can see from this example, there’s not really any fancy or mind-boggling data modeling happening here.
All we did was extract some dates, alias a few fields, and add a simple case when
statement. Again, upstream models should be doing the heavy lifting; your export models are really just adding the finishing touches to any parent models.
Step 4: QA, set-up reverse ETL sync, & train stakeholders
If we’re performing additional transformations in our export models, it’s important we treat that model like any other dbt model: one that undergoes testing and documentation processes. This means checking row counts are the same between export models and parent models and ensuring fields are in their correct format. In the future, we’d like to have more testing on our export models, such as the accepted_values test.
Once we’re confident in the dbt export models, it’s time to sync the data via our reverse ETL tool into a sandbox environment in the end business tool, where possible. This helps us and stakeholders ensure that data syncs as expected and formats look good without having to impact a production environment. In this stage, stakeholders can ensure that everything looks and feels the way it should: are fields labeled clearly? Are dbt descriptions clear? Is there any casting that happened that’s making the final data look wonky?
Once we get the okay from stakeholders and set up the syncs in our reverse ETL tool, it’s time to have some official, on-the-books training with them. It’s important they understand how the data is being generated, how the syncs work, and how to use the export for self-service motions. We usually set up one-hour discussions with them for this hands-on training session.
Step 5: Create exposures
Just as we would for key dashboards in our BI tool, we create dbt exposures for our reverse ETL syncs. Exposures are a great way to provide ownership, identify dependencies, and context into reverse ETL projects. They also allow us to visualize all of our downstream dependencies in the DAG.
We create a separate exposure_export.yml
file in the models/exports
directory where we store exposure information specifically for all of our export models. Below is an example of how we would document the exposure for the export_customers
model we created for the jaffle shop in Step 3.
- name: email_crm_export
type: application
maturity: medium
url: <reverse ETL tool sync URL>
description: '{{ doc('email_crm_export') }}'
depends on:
- ref('export_customers')
owner:
name: brandon t.
email: b******@dbtlabs.com
Step 6: Set up alerts in reverse ETL tool
We see reverse ETL pipelines fail for a few reasons:
- Major changes to export upstream models
- Unexpected destination API outages or changes
- Reverse ETL tool outages
When reverse ETL sync failures do happen (and they do happen because neither we nor the world is perfect), it’s incredibly important to communicate to our stakeholders that data may not be up-to-date. To do this, we set up alerts in our reverse ETL tool for sync failures that notify a Slack channel. We additionally create tool-specific channels (ex. #tool-sales-crm) and notify the group members about the failure (if it’s something we can’t resolve on our own).
Where reverse ETL is heading for the data team at dbt Labs
We’ve just opened the first chapter of reverse ETL opportunities for data at dbt Labs. There’s so much room to grow and look forward to, particularly around advanced orchestration, improved observability, and increased self-service initiatives.
Advanced orchestration
Right now, many of our reverse ETL pipelines are synced around the timing of their upstream parent models. These models are dependent on data being loaded via data loaders which run at different cadences. In a future world, we’ll be able to orchestrate everything from data extraction and transformation around the reverse ETL sync. More customized orchestration will allow us to move toward more real-time reverse ETL pipelines where we find that kind of data appropriate.
Improved observability
It’ll be a game-changer for our reverse ETL initiatives if we can have increased observability into our data ecosystem. If we can be quickly notified when fields are changed or source tables undergo massive changes that will impact downstream exposures, such as reverse ETL syncs, it will allow us to be more proactive in changes that need to be made to our impacted dbt models.
Increased self-service motions
This is the biggest thing for us: we want all of our data initiatives to always promote self-service initiatives. The more we can improve and add to our export modeling, the better we can enable self-service downstream analytics. With robust export models and syncs, anyone can create audiences and use fields in their tool since we know the data is accurate and regularly refreshed.
Thanks for reading how we tackle reverse ETL at the dbt Labs’ data team. We’re excited to see where this foundation takes us and where it holds us back. Like all data work, we’re always looking for ways to improve, make things easier for us, and better for business users.
Last modified on: Oct 22, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.