dbt
Blog The power of a plan: How logical plans will impact modern data workflows

The power of a plan: How logical plans will impact modern data workflows

This post first appeared in The Analytics Engineering Roundup.

If you work with SQL, you are used to working with a compiler – you just might not know it yet.

You've probably seen a compiler error message from your warehouse like this:

$ dbt run --select my_model

[...]

22:29:56   Database Error in model my_model (models/marts/my_model.sql)
  001044 (42P13): SQL compilation error: error line 4 at position 11
  Invalid argument types for function 'DATE_ADDDAYSTOTIMESTAMP': (TIMESTAMP_LTZ(9), NUMBER(1,0))

This error is because the database's SQL compiler received invalid input. Compilers are programs that translate high-level language into a logical plan (read more in our post about the key technologies behind SQL Comprehension). Historically, this capability has been constrained to the data warehouse.

But the error above wasn’t just in a warehouse! It was surfaced in dbt - a data transformation tool. This presents an interesting dichotomy:

  • Your data warehouse produces logical plans internally for one query at a time
  • Your transformation tool knows the dependency graph of every query at once (your DAG!)

Imagine if you could bring the compiler into your development workflow. You'd know the implications of your changes as you worked, instead of when you hit run.

We believe combining the powers of SQL Comprehension and DAG Comprehension will make data practitioners more effective. It will move us up the stack.

To do this right, you need the right tech - an accurate, multi-dialect, performant SQL compiler.

Today we’ll break down three major benefits of a development workflow powered by a compiler.

  1. Validate: write provably correct SQL queries, saving time and money by catching errors in SQL before running them against a cloud warehouse
  2. Analyze: with access to comprehensive, type-aware metadata you can understand how the data flows from ingestion to consumption, powering experiences like precise column-level lineage, description propagation, and more
  3. Optimize: improve performance on both the query and engine level

Validate: Will my SQL really work?

To validate your SQL code is to guarantee it will run and process data successfully (as described in this post on the levels of SQL comprehension that unlock different types of validation). The compiler can validate everything except the data itself. This means by the time the compiler has produced a logical plan, it has by definition also validated the SQL.

But most compilers validate one query at a time, in isolation.

The opportunity in a developer experience powered by a compiler is to validate all queries that make up a data pipeline.

What does this mean for you? Let’s start with a time saver - precise error messaging before executing your query in a cloud warehouse.

A compiler understands all mechanisms that can change the data type of a column, including functions like UDFs, operators, and implicit conversions. In the query above, the compiler looks and says, “hey nobody told me what to do if I see a round() function with a boolean passed in, I’m out of here!” and reports an error message.

This is a simple example - but most of the time you're not referencing a boolean directly, you're referencing a column. If you can analyze all queries at once and understand their dependencies, you can determine when modifying a column in one model will break something downstream. This is called impact analysis.

Let’s say we have two models:

-- model_1.sql 
select some_numeric_data as my_column 
from {{ source('my_source', 'my_table') }}
-- model_2.sql 
select round(my_column) as rounded 
from {{ ref('model_1') }}

The SQL in model_2 is valid - it takes my_column which contains numeric data and rounds it.

But what if you tried to change the datatype in model_1 to a boolean? Maybe it's a 1/0 indicator and you want it to be converted to true/false instead.

-- model_1.sql

- select some_numeric_data as my_column
+ select cast(some_numeric_data as bool) as my_column
  from {{ source('my_source', 'my_table') }}

-- model_2.sql

select
    round(my_column) as rounded
  from {{ ref('model_1') }}

The updated SQL in model_1 is still valid. model_2, however, is now broken — something you’d only discover when you run the DAG.

By having full awareness of both your DAG and each query's logical plan, you can understand the impact of your change during development and prevent errors in all downstream dependencies.

Once we fix our DAG to make sure the SQL is valid, what’s next? Time to analyze it.

Analyze: Generating precise column-level lineage and beyond

Now that our compiler has validated the SQL and produced a logical plan for each query in our DAG, an analyzer can step in to extract valuable metadata. During analysis, user-defined metadata (descriptions, tests, classifications) is combined with the Logical Plan to power data catalogs, automated data governance, smart caching, and more.

Precise Column-Level Lineage

By coupling SQL Comprehension and DAG comprehension with one another you could unlock:

  • a full and accurate understanding of the column dependencies in your data pipeline, informing debugging workflows and better development
  • even slimmer CI builds by only building downstream models if they depend on the exact columns that were changed
  • downstream impact analysis by knowing if changing a column in one model will break a downstream dashboard

What enables all of those experiences? Precise column-level lineage.

Column level lineage traces the changes to your columns throughout your DAG to give practitioners a deeper view into how their data is being transformed. In the example below, we can see all of the column selections and modifications that lead to the last_activity_at column in fct_users.

Historically the state of the art for column level lineage has been derived from parsing (level 1 SQL comprehension, what we have in dbt Explorer today). Generating this reliably has been slow and required a lot of additional work on the backend.

But remember that the logical plan already has information about every column and every type in your DAG, and can be generated in a snap. This makes it an even better input for CLL: using the logical plan results in completely accurate lineage of not just every column but even the properties of every struct, variant, or object inside a column.

Beyond CLL: Information Flow Theory and Metadata propagation

By leveraging principles from information flow theory, the compiler can track not just the physical or logical operations on columns, but also how data “flows” from one stage to another and how different transformations might affect its classification.

For instance, if a particular column is marked as containing sensitive information at its origin, the compiler can propagate that classification across subsequent transformations in the DAG.

This gives organizations a systematic way to maintain compliance requirements, enforce security policies, and deeply understand data provenance—making it far easier to see exactly which downstream datasets or reports are derived from a given source column and ensuring that sensitive information is handled appropriately across the entire data pipeline.

When we combine CLL with the metadata associated with a given column (name, description, whether or not it’s tested, etc.), we can propagate this metadata all the way from ingestion to consumption.

Propagation lets you put your metadata to use:

  • Naming propagation: faster development - rename a column in one place and have it updated in every SQL query that references it
  • Test propagation: save compute by removing duplicative tests - if you’ve already validated column_a is unique, don’t re-check for uniqueness unless your SQL query changes that column
  • Description propagation: DRY-er code and more complete documentation - define a column description once and it flows everywhere that column is referenced

Optimize: Right query, right place

Logical plan analysis opens up several doors to optimize our workflow. Looking at this through the lens of a compiler, which has been disaggregated from the compute engine itself, we can explore two distinct approaches to query optimization. The first is to optimize the query for the engine of choice. The second is to optimize the choice of the engine.

Query optimization

All modern widely-adopted OLAP query engines come with impressive query optimizers. However, the expressiveness of SQL can be a burden here, as code written with the best of intentions can accidentally undercut these optimizations. Instead of relying on organizational knowledge to avoid these pitfalls, compilers can automatically warn and even modify queries to ensure they produce optimal queries.

Let's look at an example in action - Query Pruning.

Query pruning allows you to make sure that your warehouse only scans the subset of data referred to in your query - ie if you're looking at just one day of data in a table it doesn't need to scan the whole table. Through static and dynamic analysis of the query, your data platform can scan drastically smaller subsets of the data (i.e. micropartitions) while still computing the same result.

Type conversions, however, can sometimes prevent your data platform from pruning your query. For example, take the following query:

select *  
from {{ ref('my_orders') }}  
where to_varchar(order_date, 'yyyy-mm-dd') = '1992-01-31' 

At first glance, this makes sense. Your date on the right hand side is rendered as a ISO-8601 string, so you want to make sure the order_date values are formatted the same way. Unfortunately, this cast to varchar will be a major performance hit.

The direction of the cast in this query is backwards - by casting the entire column’s values to varchar, Snowflake can’t use its date-specific pruning capabilities. It’d be better to cast the '1992-01-31' string to a date; and, in fact, left to its own devices, Snowflake would implicitly do just that, meaning the following query is significantly more performant:

select * from {{ ref('my_orders') }} 
where order_date = '1992-01-31'

That's the type of query optimization you'd want your compiler to flag - helping you get out of the way and let the data platform optimizer do its thing.

Let’s look at another example with CTEs. Thanks to our friends at SELECT, we know that referencing a CTE more than once prevents another type of query pruning: column pruning.

Consider this query, where we select * from a model (an “import CTE”). Using it in two different CTEs will take longer than necessary to run on large datasets:

with american_sales as (
    select * from {{ ref('sales') }}
    where region = 'AMERICA'
),

-- Calculate the total sales for all products
total_sales_america as (
    select 
        sum(amount) as total_sales
    from american_sales
),

-- Calculate the total sales for each product
product_sales_america as (
    select 
        product_id, 
        sum(amount) as product_total_sales
    from american_sales
    group by product_id
)

...

As you can see, total_sales_america only needs the amount column, and product_sales_america needs both amount and product_id.

If we only had the first CTE (total_sales_america), Snowflake would recognize that the * in the import CTE was overkill and only scan the amount column from the sales model. It would prune the rest of the columns from its table scan (that's how column pruning gets its name!).

But in this case, Snowflake will scan all columns – even though product_sales_america's column requirements are a superset of total_sales_america's.

Heeding the compiler’s warning, we can improve the query’s performance by specifying only the columns we need:

with american_sales as (
    select 
	    product_id,
	    amount
    from {{ ref('sales') }}
    where region = 'AMER'
),

-- Calculate the total sales for all products
total_sales_america as (
    select 
        sum(amount) as total_sales
    from american_sales
),

-- Calculate the total sales for each product
product_sales_america as (
    select 
        product_id, 
        sum(amount) as product_total_sales
    from american_sales
    group by product_id
)

...

Static analysis of the logical plan opens up a plethora of opportunities to optimize a single query, but like we mentioned before, databases are already pretty good at this. Where this gets even better is when you can understand all queries at once, even if they’re written for different engines.

Engine optimization

Five years ago it was rare to hear a data team operating on multiple distinct data platforms. Today, not so much. This is in large part fueled by the adoption of open table formats like Iceberg, which unify the data storage format across warehouses and unlock an abundance of opportunities to build cross-platform data workflows. But today, these workflows are missing a crucial component: a multi-dialect compiler.

This compiler allows data teams to fine-tune their DAG execution. They can place their queries on a sliding scale between cost and performance based on their priorities and use case. This can even be done on a single engine, dynamically selecting the optimal warehouse size for a query based on its data load history and plan complexity.

The average dbt user runs all their development queries, unit tests, and CI checks against the same massively scalable engine they use to run their DAGs in production. However, most of these are querying relatively small datasets, especially when using the upcoming sample mode in dbt Core. Using a nimble single-node engine like Apache DataFusion for these workflows would make them faster and more cost-efficient, since on small datasets with low complexity, these engines often perform better and can remove network overhead.

Imagine introducing a new model and testing it on a speedy single-node engine - how would you guarantee it behaves the same when run in production? You could manually inspect the output of potentially thousands of rows, or you could guarantee identical behavior by having an intelligent compiler produce the same logical plan for the query that would be created for the production engine.

Without guaranteed conformance, you’d have to literally run all your queries on all engines to be certain you were getting the same data - not a very likely workflow, so instead you’re probably accepting poor data quality as part of your process. Multi-dialect validation is a requirement of an engine that will reliably power cross-platform data workflows.

Conclusion

A system that can analyze both the DAG of transformations and the logical plan of each transformation is more powerful than systems that can only understand one or the other. We are very excited about what compiler-driven experiences are going to power across the ecosystem.

Last modified on: Feb 03, 2025

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts