Today, we're excited to announce that dbt Core v1.9
(named after Dr. Susan La Flesche Picotte) is GA. Since we started this journey in 2016, dbt has quickly become the standard in data transformation, with over 50,000 teams worldwide relying on dbt to build, test, and document their data products. dbt Core v1.9
includes improvements large and small designed to help data teams work more like software engineers—that is, in a way that's modular, scalable, repeatable, and governed. I'll dive into more details in this post, but the highlights of v1.9
include:
The big hits:
- 🤏 New microbatch incremental strategy to optimize your largest datasets—transform your event data in discrete periods with their own SQL queries, rather than all at once
- 📸 New configurations and spec for snapshots to make them easier to configure, run, and customize
The smaller stuff:
- 👯‍♀️ Improvements to state:modified behaviors to help reduce the risk of false positives
- đź“„Â Document your data tests by adding
description
s - and more!
And adapter-specific features:
- 🧊 Standardizing support for Iceberg, with standard configs on more adapters to materialize dbt models in Iceberg table format (read more here)
Check out our upgrade guide for more information and read on for a deeper dive into the community efforts and conversations that helped bring these features to life.
Marriage: It’s what brings us together
If you attended my 2024 Coalesce talk, we’re married now ;)
To symbolize our continued commitment to open source and discuss the ways our relationship has grown over the years, I hosted a very special ceremony in which I renewed my vows with the dbt Community. And, it’s Vegas, so of course Elvis was there to officiate.
dbt Core was created eight years ago—a tool and framework to standardize the way we all do data transformation.
Because our co-founders decided to make dbt Core open source—following the pattern set by the best CLI tools and programming languages—the dbt community was able to grow organically and collaboratively.
dbt enables data practitioners to work like software engineers—to automatically handle dependencies, test and document our data models, and version control our code. But the real power of dbt is providing a standard framework for doing this work. dbt gives us a common language that empowers us all to share our solutions and build together.
dbt isn't just a tool, it’s a community. We are a group of people who have chosen to believe in a viewpoint—and then watched as that viewpoint transformed data work across the industry.
Like any relationship that’s lasted this long, we’ve grown and changed together.
dbt, the company, has grown from an analytics consulting practice with an open source tool, to an open core business with a real path to long-term sustainability (one that can fund the ongoing development of that open source tool).
dbt, the product, has gotten a lot more mature and stable.
dbt, the community, has expanded to over 100,000+ members, all across the world.
Eight years and 50,000 weekly active dbt projects ‌later, dbt Core is the industry standard for data transformation; and we're committed to living up to the trust organizations worldwide have shown in us.
What does that mean?
It means:
- dbt Core will remain licensed under Apache 2.0.
- the dbt framework will continue to be shaped by a collaborative effort between you (the community) and us (the maintainers)
- when we add something new to the standard, we are committing to the long term. We must be intentional about how and when we do it. You can be confident that, once added, it's there to stay.
Extensibility is what powers the community
So: We take our responsibility of owning the standard very seriously, and we aim to be really intentional when that standard is updated or expanded.
One of the best things about dbt is that it is flexible and extensible. If you have a problem to solve, you can use the tools dbt gives you—yes, including jinja—to implement a solution.
Have you ever written a custom materialization, overridden one of the built-in core macros, nested some DML in a post-hook, to solve a niche problem for your data team?
I certainly have. The ability to do this type of “customization” is by design.
A mantra for our team, borrowed from the programming language Perl, is “make the easy things easy, and the hard things possible”.
We want you to find creative solutions, unblock yourself, iterate rapidly, and share your code snippets to unblock others. We don’t want to be a bottleneck for you getting your day-to-day problems solved.
Providing an out-of-the-box solution for the entire gamut of problems you might encounter when doing data work wouldn’t be feasible or recommended. dbt the framework should be opinionated and have clear scope. And the extensibility of dbt can and should be leaned on to solve a whole host of niche problems.
We also recognize that extensions have their limitations, and sometimes, the right strategy is to go from making something "possible" via custom extensions... to making it "easy" with an out-of-the-box solution built into the dbt Core standard.
So how do we know when it’s time to make that transition?
We rely on a few signals:
- When we see a ton of people upvoting an issue in our GitHub repos
- When we see multiple open source dbt packages trying to close the gap
- When we see people leaving dbt’s framework to solve this problem, rather than hacking within it
- When we see technical advancements in the ecosystem, a hard thing becoming easier
These things tell us a capability is ready to graduate to become an official part of the dbt standard.
A lot of the features available to you now in dbt Core v1.9
are things that community members have been discussing and experimenting with for years.
Thank you for your efforts, for your thoughtful conversations, for helping us shape the future of dbt.
Let’s get into it.
What’s new in dbt Core v1.9
v1.9
delivers two new major features to the dbt framework, as well as a long list of smaller improvements. For each of the “big ones,” let’s take a look at not just what the feature is, but how community efforts shaped how the feature was prioritized and built.
Microbatch incremental models
The incremental
model materialization was first introduced in 2016 in v0.4.0
. It is a foundational part of how we think about optimizing your datasets that are too large to be dropped and recreated from scratch every time you do a dbt run
.
The incremental
materialization is an advanced strategy best used on models that are large enough that you don’t want to do a full refresh on every run. Instead of reprocessing an entire dataset every time a model is run, incremental models process a smaller number of rows with new data, and then append, update, or replace those rows in the existing table.
If anything goes wrong or your schema changes, you can run in “full-refresh” mode, by running the same simple query that rebuilds the whole table from scratch.
In 2018, an experimental materialization called insert_by_period
was created as a further performance optimization for datasets that are so massive that rebuilding the whole table in a single query is just not possible without running into warehouse timeout issues.
Instead, insert_by_period
processed event data in discrete periods with their own SQL queries, rather than all at once.
This is why extensibility is so powerful; we can experiment, we can let a solution bake, and we can then ask the question, “Should this become a real part of dbt?” (as Joel did in 2021).
While the original experimental materialization was only supported in Redshift, it was expanded this year to work for other data platforms, including BigQuery and Databricks.
This experiment allowed us to iterate on batched-based processing and, in the meantime, unblock dbt users who needed this kind of approach. However, it lacked the official seal of support from dbt project maintainers. And limitations existed—it required copy/pasting macros into your own project, batches had to be run in serial, and there was no individual logging for each batch.
It was time for this feature to be built directly into dbt Core.
We are excited to announce that starting in dbt Core v1.9
, you can use the brand-new microbatch
incremental_strategy
to break up your massive datasets into smaller, bite-sized batches that dbt can process individually, making your workflows faster, more reliable, and easier to manage.
Simply set your incremental_strategy
as microbatch
and write your SQL for a single “batch” of data.
dbt will then evaluate which batches need to be loaded, break them up into a SQL query per batch, and load each one independently.
Batches can be run concurrently, and batch size can be hour
, day
, month
, or year
. Check out our docs for more information.
Improvements to snapshots
Snapshots were first released as dbt archive
in v0.5.1
back in 2016, just two days shy of dbt's 6-month anniversary.
Snapshots provide a solution for capturing changes to your data as type-2 Slowly Changing Dimensions, so you can “look back in time” at previous states of your tables to incorporate historical data into your analyses and measure trends over time.
While they were originally declared within the dbt_project.yml
, in 2019 they moved to a dedicated snapshots folder and were defined within a special “jinja block”. (Oh {% snapshot my_snapshot %}
!)
We built snapshots a long time ago, and the issues that the community has opened over the past ~7 years signaled to us that the original snapshots hadn’t really kept up.
Part of our responsibility of maintaining the standard includes admitting that we don’t always get it right.
In early 2023, our DX advocate for dbt Core (Doug Beatty aka “Timestamp Doug” for those in the know) opened the discussion “Problems & (Potential) Solutions for Snapshots” pulling from a long list of issues and comments describing the ways that our current implementation of snapshots just wasn’t cutting it.
And so, I’m happy to announce that dbt Core v1.9
incorporates a number of improvements that make snapshots easier to configure, run, and customize, including:
- New simplified snapshot specification: snapshots can now be configured in a YAML file, which provides a cleaner and more consistent set up
- NewÂ
snapshot_meta_column_names
 anddbt_valid_to_current
configs: allow you to customize the meta fields that dbt automatically adds to snapshots target_schema
 is now optional for snapshots: when omitted, snapshots will use the schema defined for the current environment, meaning you can maintain environment-aware snapshots- New
hard_deletes
config: get more control on how to handle deleted rows from the source - And more
Check out our docs for more information.
The smaller stuff
It’s not just about the big stuff. The little things matter too, and we tackled a lot of them in v1.9
. To name a few:
- set your foreign key constraints using
ref
- document your data tests by setting a
description
- less false positives for
--select state:modified
Check out our docs for more information.
I'm just a girl PM, standing in front a boy community, asking him them to love her <3
On a personal note, I feel so lucky to be building this product with all of you. I get to log on to my computer every day and interact with thousands of “internet friends” who want to help make dbt the best it can be. This community is so special, and it’s truly an honor to be a part of it. To everyone who participated in what we built this year, and all of you who are helping shape what we build next, thank you.
Speaking of…
I want to hear from you!
What do you struggle most with when using dbt today? What custom solution have you had to implement multiple times that you wish dbt offered out-of-the-box? Big or small, I want to know what’s on your wishlist of dbt features.
There are so many ways to participate in this community:
- Upvote and comment on GitHub issues, or start a GitHub discussion or discourse post when something’s not-so-clear-cut
- Join us on Zoom for feedback sessions to help us design features that feel like dbt
- When you find a way to solve that unique problem, share it—in a blog post, at a dbt meetup, or by talking at Coalesce
- If you really want to get into the weeds, contribute code back to one of our open source repos, for one of our issues tagged
help_wanted
andgood_first_issue
, and our engineering team will work with you to get it over the finish line
You don’t have to write code to contribute to the dbt open source community. Sharing your different approaches and what you’ve learned in practice is how we all move up the stack.
I said my vows to you on the Coalesce stage, and I’ll say one again here:
We vow that dbt is not dbt without you, the community. Whether you use dbt Core or dbt Cloud or some mixture of the two… your thoughts and opinions matter to us. Because this is a relationship, and we want to build dbt with you for a long time to come.
Last modified on: Dec 10, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.