dbt
Blog Accelerating dbt development with SDF

Accelerating dbt development with SDF

At dbt, we're proud that our technology has helped so many companies bring higher quality and repeatability to their data workflows. However, nothing's perfect. When it comes to dbt —whether developers are using dbt Core or dbt Cloud —we know there's a lot we can do to speed up the analytics code development process.

That's why dbt acquired SDF - and why both teams are now working closely to make SDF the future of the dbt engine.

dbt was built by a bunch of data practitioners and hackers whose work found purchase because they understood the analytics workflow well. SDF has a different DNA, springing from managing Personally Identifiable Information (PII) across thousands of data workflows at Meta and other massive companies.

While the DNA of the two companies may be different, we both believe they make a perfect match when fused together. SDF brings faster performance, optimized data platform costs, and better data lineage to dbt - all without changing the code in your project. Over time, it’ll bring even more. In fact, we regard SDF as the future state of the dbt engine.

We’ll dig into what problems SDF solves for dbt, how it makes the dbt developer experience better, and when you can expect to start wielding it.

How the SDF engine improves the dbt developer experience

dbt is a fantastic configuration tool and scheduler - once you specify your data lineage model and visualize it as a Directed Acyclic Graph (DAG). However, you have to manually specify the DAG with ref statements. As an engine, dbt understands the SQL that you write more as a series of strings that it expands using Jinja templating.

The problem is, until your queries hit the data warehouse, dbt has no way to tell if they’re valid. You always have this overhead of talking to a data warehouse before you get any feedback on whether or not something is correct. That puts the framework behind programming languages like C, Python, and Rust, which you can run locally.

The problem is that SQL is a highly vendor-driven language. Unlike Python, which has a single runtime, there can be as many different versions of SQL as there are cloud data warehouses. We’ve had 30 years where vendors have been able to slowly differentiate and create different structures, types, and functions that are only compatible with their special SQL dialect.

SDF wanted to change this lack of a good local development experience for SQL transformations. So, we partially built out and defined what the grammars look like for different SQL dialects in different data warehouses. We have a different grammar for Snowflake that’s equivalent to the grammar that Snowflake itself uses, one for Redshift, and so on.

With that grammar defined locally for (for example) Snowflake, we can build out something that looks like the Snowflake compiler that runs locally.

In other words, with SDF, we don’t just resolve your dbt Jinja templating. We actually compile and validate the SQL query before it ever gets to the data warehouse. That means that if there's an error, you get that error much, much, much quicker in the development cycle.

How SDF works

So, how does this work? Lukas does a full walkthrough in our related webinar, which you should check out. Here’s a quick summary.

Note: We’re discussing here how SDF works currently. In the future, you won't have to run Separate SDF commands or change dbt commands to take advantage of the benefits of SDF. dbt is bringing SDF into the product, giving all dbt developers a set of powerful new checks and tools as part of their existing dbt workflows.

Currently, to use SDF, we create an SDF workspace that includes some SQL models, functions, and tests. We can then run an sdf compile - run by a small, fast binary compiled from Rust - to test recent code changes we’ve made.

SDF compile

The first thing that SDF does is download the relevant schemas from your data warehouse. Currently, there's nothing else that stitches together the current state of the data warehouse with the current state of your code. SDF bridges that gap.

With all this downloaded, we can now validate your code against the current state of your data warehouse. This makes development faster because you catch bugs and issues with state as you're developing rather than when you're running code in your CI/CD system. As a result, your deployment pipelines break far less often.

Detecting errors in under half a second

For example, let’s break something by referencing a column that doesn’t exist. If we compile again, SDF reports - in less than half a second - that we’ve broken our query by referencing a non-existent column.

broken query

We can fix it and recompile and - in less than half a second - we get the all clear.

In short, once they’re synced to the state of our current data warehouse, developers can code and test rapidly on their local boxes with zero latency and without getting stuck in eternal CI/CD loops. And this works for data warehouses of any size and scale.

Developers get automatic state management as part of this system as well. That’s because SDF generates a lot of metadata - column lineage, code state, the data warehouse state, etc. - and then uses that to optimally calculate and recalculate what needs to be run at any given point in time.

For example, let’s say we have 12 tables, including an orders and a daily_orders table, with the latter depending on the former. We change orders and re-run our models.

change

In our previous run, we ran 12 models, and all 12 succeeded. In the second run, the engine resolved dependencies, discovered what metadata and data changed, and then only ran two models - orders and daily_orders. Spread across an entire company, this saves a crazy amount of both money and time.

Removing PII from your downstream consumers

Let's look at one other powerful piece of SDF: data classification. SDF supports its own native data classification system so you can use metadata to track personally identical identifiable information (PII) - like a Social Security number, credit card, etc.

While handling PII is critical in today’s modern data architectures, it's difficult to actually build an automated system that does this for your whole data warehouse. Instead, you have to tag every column manually all the time and update it as your warehouse is changing.

It's a nightmare. It's not fun. Most of us at dbt and SDF - and many of you, we’re sure - have spent some time manually tagging things. It's terrible.

In SDF, we can tag PII easily. We can also specify how PII is being converted —for example, in this code, we tell the system that the function MD5 converts any PII to hashed values.

tag Pll

Why is this important? Data changes as it flows through the warehouse. So you can't just propagate PII downstream along column-level lineage. You have to understand how functions change that PII. Once you define this, you can take advantage of an integrated information flow theory engine that propagates this for you downstream.

As you can see in this example, we've added PII classifiers to certain fields - such as first name, last name, and phone number - in our RAW_Customers table.

Pll classifiers

But if you scroll down a little bit, you'll see that the CRM_customers table also has this PII. And somewhere down here, big customers have some PII hashed and some PII natively.

What the engine is doing is working with higher-level types in addition to the varchars and the ints that the SQL engine is working with to propagate those types through the warehouse. You only have to classify things at ingestion, and then you automatically get full coverage across the data warehouse.

Local linting

Finally, SDF also supports a built-in lint command that works differently than SQLFluff. It operates on the syntax tree generated by the parser, which makes it quite fast.

Below, we run the sdflint command. You can see that it finishes in about 0.4 seconds.

sdflint

Here, it's showing us where we have a number of consistency issues —for example, keywords here are expected to be consistent, there we have issues with casing, etc. We can quickly fix these issues and, in another 0.4 seconds, validate that everything is consistent.

Bringing dbt into SDF

Both dbt and SDF are working hard together to turn our two engines into one. Our goal is to offer as much SDF functionality as we can into open-source dbt Core, with a subset integrated into dbt Cloud. You can learn more at our upcoming launch event on March 19th.

To dive more deeply into what’s coming down the road with SDF integration into dbt and see how SDF works in practice, watch the full webinar.

Last modified on: Feb 10, 2025

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts