Blog How to build scalable data pipelines with Snowflake and dbt

How to build scalable data pipelines with Snowflake and dbt

Apr 07, 2025

The demand for high-quality data is accelerating. That’s especially true with the emergence of AI and machine learning use cases. This means analytics engineers need tools that enable them to create, test, document, and deploy data pipelines faster than ever before.

Over the past several years, dbt Labs and Snowflake have both spent a lot of person-power figuring out how to remove friction from the data pipeline development experience. Using the two systems together, analytics engineers can bring new data products from conception to reality quickly—without sacrificing data quality.

I’ll dive into the benefits of using dbt together with Snowflake, then show the two in action by implementing a data pipeline that analyzes trading profit and loss. I’ll also dig into some advanced use cases, such as leveraging a Large Language Model (LLM) to extract patterns and sentiments from trader execution notes, and discuss other ways that dbt and Snowflake are leveraging AI to accelerate data pipeline production.

Watch on-demand: Deploying Data Pipelines with Snowflake and dbt Labs

Why dbt + Snowflake for data pipelines?

dbt is the industry standard for data transformation at scale. It’s a framework that sits on top of your data warehouse and lets anybody who can write SQL statements deploy production-grade pipelines on top of Snowflake.

One thing that makes dbt such a strong framework is that it implements DevOps-style software engineering best practices in the data world. This means that dbt brings ideas like version control and modularity to your data engineering pipelines.

Snowflake is a powerful AI Data Cloud that allows you to build data-intensive applications without the operational overhead. Its unique architecture and years of innovation make it the best platform for mobilizing data in your organization.

Snowflake has also released a number of features that support a DevOps (or “DataOps”) approach to data pipelines. These include source control repository integration via Git, building Snowflake data primitives from your Git repo code, and running scripts and deploying complex objects—data models, Machine Learning (ML) models, native apps, etc.—with the Snowflake CLI. You can also use the Snowflake API from your Python scripts to work with Snowflake in a more Pythonic way.

With dbt plus DevOps support from Snowflake, you can now mix and match features to leverage advanced change management and build flexible Continuous Integration/Continuous Deployment (CI/CD) pipelines. This enables you to elevate your data engineering practices to the next level without resorting to third-party tools such as Terraform.

dbt and Snowflake both work well together here, in part, because both take a declarative approach to DevOps for data. Both systems work together to remove a lot of the lower-level complexity involved in working with data. That automation increases the number of people who can meaningfully transform data, leading to increased data democratization.

Building your dbt-Snowflake pipeline

Here’s how these features come together to make it easier to build data pipelines. In this walkthrough, you’ll use Snowflake and dbt to analyze trading profit and loss (P&L).

First, you’ll calculate P&L by normalizing trade data across multiple currencies by using current foreign exchange (FX) rates and compare actual performance against the portfolio targets. To do this, you’ll leverage datasets from Snowflake Marketplace on FX rates and US equity price history. You’ll blend these with a couple of manual sources coming from CSV files that contain history from the trading desk, as well as the target allocation ratios we’ll use as a configuration.

Additionally, you’ll leverage the power of an LLM to extract patterns and sentiment from trader execution notes, giving analysts more insight into the decision-making behind their activity.

This article gives a lot of technical background on each step. It’ll also discuss options for various steps in case your scenario differs slightly. If you want to focus on just setting up and running the code, you can use the step-by-step guide here. You can also see a detailed walkthrough of this demo in our webinar replay.

Prerequisites

A Snowflake account, plus a user with the ACCOUNTADMIN permissions
A GitHub account into which you can import (fork) the example code. You can join GitHub for free if you don’t have an account you can use for this walkthrough.

Access data

To get started, import the two data products from Snowflake Marketplace: the US Stock Prices table and the Currency Exchange Rates data.

Snowflake's unique value proposition started with a completely new way of sharing and collaboration both within and between organizations. In this scenario, you’re going to leverage this power to connect to two free public resources available to all Snowflake users.

This data is immediately available in your account as a source. There's no ETL. There’s zero latency and no additional storage cost, because the data still sits on the provider side. You’ve just become a consumer of it.

After this, you’ll have two data sets available as if they’re a part of your system: the Forex currency exchange and stock trading stock prices data.

Next, you run a script from Snowflake to connect to GitHub and set up your development environment. Once that's done, your Snowflake environment will have a pointer to your forked Git repository. If you click on it, you can see exactly the same files you have in GitHub.

In the GitHub repo’s deploy_environment.sql file, you’ll see a file that has two unique attributes:

First, it has these curly brackets that represent the Jinja templating language. This is exactly the same syntax you see everywhere in dbt, which is also Jinja template-based. In this case, the script has templatized environments with the {{env}} variable so you can re-run this script to create your development, staging, testing, or production environments.

Second, there are the CREATE OR ALTER object commands - DDL language for creating tables and views. With this script, you don't need to care about the state of your environment. You can rerun that script repeatedly and recreate new environments for dev, staging, or production that have everything you need. You can set up your environment right from your Git repo by using the EXECUTE IMMEDIATE FROM command, which can execute a file written with Jinja:

USE ROLE ACCOUNTADMIN;

USE DATABASE SANDBOX;



ALTER GIT REPOSITORY DEMO_GIT_REPO FETCH;

EXECUTE IMMEDIATE FROM @DEMO_GIT_REPO/branches/main/scripts/deploy_environment.sql USING (env => 'DEV');

Transform data with dbt

Now, let’s see how things work on the dbt side. First, use dbt to initialize your empty environment with the dbt seed command. This will upload your CSV files containing your trading desk history and target allocation ratio data.

cd dbt_project

dbt seed

Next, run your dbt models, which specify how data will be transformed from the source and stored in the target tables. These can be different data systems; in this case, you’ll use Snowflake as both the source and the target.

The project divides its models into three logical groupings—staging, intermediate, and marts—which is one effective way to model your data for different purposes by segmenting them into different schemas.

I won’t go too deep into the dbt models here. Suffice it to say, I generated a bulk of this code using dbt Copilot, which leverages LLMs and information stored in a project to generate code based on natural-language descriptions. I’m excited about how well this works and the potential to boost developer productivity that this brings.

You can run these models with the dbt run command. dbt provides flexibility in how the new data is represented, or materialized, in Snowflake. Initially, the models are configured to materialize as views, which speeds up development and testing. For production, you’d likely use table materialization for better query performance.

You can change the materializations at any time—and set them differently in different environments—to suit your use case and optimize your pipelines. For example, if you’re materializing as tables (which will make querying in production more efficient), every time you run your model, it will copy all of the data over. If you want to be more efficient, you can change the materialization option for a given model to switch from table to incremental materialization, which only surfaces the rows you tell dbt to filter for. For example, you can tell dbt to only import those rows that were created or modified after a specific date.

Scale with Snowflake DevOps features

Finally, you can leverage a combination of Snowflake and dbt features to deploy your pipeline to production. Here, you use the same approach you used to set up your dev environment by leveraging the EXECUTE IMMEDIATE FROM command but with an environment value of prod supplied as an argument:

USE ROLE ACCOUNTADMIN;

USE DATABASE SANDBOX;



ALTER GIT REPOSITORY DEMO_GIT_REPO FETCH;

EXECUTE IMMEDIATE FROM @DEMO_GIT_REPO/branches/main/scripts/deploy_environment.sql USING (env => 'PROD');

Then you can use dbt to deploy the seed data and run your data transformations:



dbt seed --target=prod

dbt run --target=prod

You could have used Snowflake stored procedures for all of this work. However, in my view, dbt represents a huge productivity boost in a number of ways:

Because dbt uses a declarative vs. an imperative model, you save a lot of time in building declarative models versus imperative workflows.
The dbt logic is also portable. Since all of your data transformations are code, you can put them under version control, where they can be change managed, and other analytics engineers can find them. These devs can easily run it on other Snowflake instances, leveraging the code to recreate this pipeline themselves or build more sophisticated ones.
dbt’s support for testing improves data quality, while its support for shipping rich documentation with its models makes it easier for stakeholders to understand and use the resulting data.

There will still, however, be custom components specific to Snowflake—tasks, streams, stored procedures, etc.— which might not be supported directly in dbt. In the past, you’d have to set these up with something like Terraform or DIY your own solution. New Snowflake DevOps features—such as the Snowflake CLI and the Python API—mean you can manage your data transformation models and all Snowflake objects via automated CI/CD workflows.

Optimize for performance and cost

Using dbt and Snowflake features, you can optimize this even further. For example, you can leverage Dynamic Tables in Snowflake to deploy your intermediate and mart models. With Dynamic Tables, Snowflake takes care of the table creation, so you don’t even need to create a separate data pipeline.

Using Dynamic Tables is as simple as turning on the relevant option for the relevant model group in your dbt_project.yml:

Note that if you’ve defined data quality tests in dbt, these will still be run even when using Dynamic Tables. So you get the benefit of fast, low-overhead creation of data while also checking for and issuing alerts on any data quality issues with every run.

Another great advantage of using dbt with Snowflake from a cost optimization standpoint is that you can dynamically control what size Snowflake instances you use for your runs at a granular level. You can, for example, say for one model you want to resize up from a small warehouse instance (the default) to a medium for the model run — and then immediately size it back down.

You can control this for each model, for a subset of models, and even based on the environment. I’ve seen customers resize dynamically based on how much data was in the input. It’s a powerful feature that grants customers a lot of flexibility.

The future of data pipelines: AI and automation

You can take your dbt and Snowflake pipelines even further by leveraging new functionality from each platform. For example, the intermediate/int_extracted_entities.sql model utilizes Snowflake Cortex to extract the trader’s notes and uses an LLM to detect what market signal is driving the trade. It also classifies the note based on what execution strategy the trader seems to be taking—e.g., taking profit or cutting losses.

with trading_books as (

select * from {{ ref('stg_trading_books') }}

),



-- Extract sentiment using SNOWFLAKE.CORTEX.SENTIMENT

cst as (

select

trade_id,

trade_date,

trader_name,

desk,

ticker,

quantity,

price,

trade_type,

notes,

SNOWFLAKE.CORTEX.SENTIMENT(notes) as sentiment,

SNOWFLAKE.CORTEX.EXTRACT_ANSWER(notes, 'What is the signal driving the following trade?') as signal,

SNOWFLAKE.CORTEX.CLASSIFY_TEXT(notes||': '|| signal[0]:"answer"::string,['Market Signal','Execution Strategy']):"label"::string as trade_driver

from trading_books

where notes is not null

)

select * from cst

On the dbt side, the company recently acquired SDF, a high-performance toolchain that acts as a multi-dialect SQL compiler, linter, and language server, among other things. SDF can faithfully emulate cloud data warehouses such as Snowflake, catching breaking changes during development before analytics engineers even check in a single line of code. This speed means SDF can catch errors as you’re typing—well before they even first type in dbt run.

Finally, I already mentioned how I leveraged dbt Copilot to generate most of my models. dbt Copilot integrates with the dbt Cloud Integrated Development Environment (IDE) to assist in the generation of code, documentation, tests, metrics, and semantic models. This greatly reduces the time spent writing models, accelerating the deployment of new data pipelines.

Bringing AI-powered pipelines to life with dbt + Snowflake

dbt and Snowflake both implement a number of declarative features that unlock faster and more efficient end-to-end data workflows. Used together, teams can initialize, model, test, document, and deploy data pipelines powered by the latest advances in AI faster than ever.

See it first-hand for yourself. Sign up for a free Snowflake account and a free dbt Cloud account and try it out yourself by following our walkthrough.

Last modified on: Apr 07, 2025

Early Bird pricing is live for Coalesce 2025

Save $1,100 when you register early for the ultimate data event of the year. Coalesce 2025 brings together thousands of data practitioners to connect, learn, and grow—don’t miss your chance to join them.

Save $1,100 — Register Now

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Great data professionals never work alone

Every industry leader understands one thing: you need the right network to grow. The dbt Community connects you with 100,000+ data professionals—people who share your challenges, insights, and ambitions.

If you’re looking for trusted advice, expert discussions, and real career growth, this is the place for you.

Join the dbt Community Learn more

Solve your toughest challenges

Join today and get real-world advice from experienced pros.

Expand your network

Foster connections with meetups, local groups, and like-minded peers.

Advance your career

The dbt community is full of learning opportunities and shared job postings.