How to Refactor Legacy SQL for Modularity | dbt Labs
May 24, 2022
ProductWhat is dbt Live?
The dbt Live: Expert Series consists of 60-minute interactive sessions where dbt Labs Solution Architects share practical advice on how to tackle common problems that they see over and over again in the field — like how to safely split environments with CI/CD, or how to implement blue/green deployments.
The event agenda is shaped by audience requests – so whether you’re new to dbt or looking to brush up on a specific topic, we’d love to see you there! Register to join the next live session.
Session Recap: Refactoring Legacy SQL
The EMEA-friendly edition of dbt: Live featured two experts from dbt Labs:
- Sean McIntyre, Senior Solutions Architect
- Benoit Perigaud, Senior Analytics Engineer
Sean started things off by tackling a problem many folks find daunting: refactoring legacy SQL scripts into smaller components that can move into dbt. You can watch the full replay here!
Let's start with why
Why might you take the time to refactor and migrate?
- Version control: Enable others – or “future you” – to edit or share data transformations without resorting to email.
- Readability: Help others – new team members in particular – learn what the code is doing.
- Modularity: Break monolithic transformations down into smaller components that can be tested and reused more easily.
Scenario & Overview
It’s your first day as an Analytics Engineer at Jaffle Global. Your boss wants to filter customer data by a new dimension, country. He emails you the script that generates the current customer table – hundreds of lines of code – and asks you to figure out how to add the new column. How do you start?
An overview of Sean's approach:
- Review the legacy script to understand its intent. Break it down into smaller stages of transformation.
- Move the pieces into dbt and relate them to each other.
- Build paid_orders model (Bonus lesson: incremental models!)
- Build customers model, and relate it to paid_orders.
- Define sources. (Bonus lesson: using packages to save time and effort!)
- Test and validate results.
- Commit the changes 🎉 and stick around for Q&A.
Step-by-Step Breakdown
Start by copying the entire query that needs to be refactored and pasting it into the text editor in the Cloud IDE for easy reference.
Sean uses dbt Cloud to push changes from his text editor to his connected code repo in Github and data warehouse in Snowflake. You can adapt his steps if you prefer to work locally with dbt Core, or use a different IDE or cloud data warehouse connected to dbt Cloud.
Build A Model for Paid Orders
Copy the SELECT statement for order data from the legacy query above. Save it into a new file in your dbt project named paid_orders.sql
.
select
Orders.id as order_id,
Orders.USER_ID as customer_id,
Orders.ORDER_DATE AS order_placed_at,
Orders.STATUS AS order_status,
p.total_amount_paid,
p.payment_finalized_date,
C.FIRST_NAME as customer_first_name,
C.LAST_NAME as customer_last_name
FROM raw.jaffleshop.orders as Orders
left join (
select "orderID" as order_id, max(CREATED) as payment_finalized_date, sum(AMOUNT) / 100.0 as total_amount_paid
from raw.stripe.payment
group by 1) p ON orders.ID = p.order_id
left join raw.jaffleshop.customers C on orders.USER_ID = C.ID
-- incremental loading logic
where order_placed_at > (select coalesce(max(order_placed_at), to_timestamp('2010-01-01T00:00:00Z')) from sa_sean_m.special_analysis_1.paid_orders);
Comment out the incremental logic for now -- we'll come back to this later.
Run this query as-is in dbt, to make sure it returns results before you proceed with changes.
Convert the subquery to a CTE and rename the results set from p
to payments
, to make its meaning transparent.
WITH payments as
(select
order_id,
max(created) as payment_finalized_date,
sum(amount) / 100 as total_amount_paid
from raw.stripe_payment
Group by 1)
SELECT
orders.id,
orders.user_id,
orders.order_date as order_placed_at,
orders.status as order_status,
payments.total_amount_paid,
payments.payment_finalized_date,
customers.first_name,
customers.last_name
FROM jaffleshop.orders AS orders
LEFT JOIN payments
ON orders.id = payments.order_id
LEFT JOIN raw.jaffleshop.customers AS customers
ON customers.id = orders.user_id
-- incremental loading logic
-- where order_placed_at > (select coalesce(max(order_placed_at), to_timestamp('2010-01-01T00:00:00Z')) from sa_sean_m.special_analysis_1.paid_orders);
Run the updated query to make sure it returns results.
Incremental Models
You may want to keep the incremental logic in the model, to optimize performance and limit the amount of data processed on routine runs.
To convert it to dbt’s syntax:
- Wrap the WHERE clause in an
is_incremental
macro. - Use a "this" variable to refer to the model we're working in.
This revised block tells dbt to process only the latest orders on incremental runs:
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where order_placed_at = select max(order_placed) from {{ this }})
{% endif %}
Add a config block to the top of the file, to tell dbt how to materialize results in the warehouse:
To change schema and rebuild the model entirely, pass a full-refresh
argument in the run command:
dbt run --full-refresh --models paid_orders
This tells dbt to ignore the incremental logic on this initial run, compile to a CREATE OR REPLACE TABLE
statement, and reprocess the entire model this time.
You can run future updates without the full-refresh
argument. dbt will then use the incremental logic, process a smaller amount of data, and merge results into the base table created by the initial run.
Build A Model for Customers
Next, we'll refactor a stored procedure for transforming customer data, and relate it to the paid_orders
model created in the last step.
Here’s the procedure we're starting with:
create or replace procedure update_customers()
returns text
language sql
as
$$
merge into sa_sean_m.special_analysis_1.customers c using (
select customer_id
, min(order_placed_at) as first_order_date
, max(order_placed_at) as most_recent_order_date
, count(order_id) AS number_of_orders
, sum(total_amount_paid) as customer_lifetime_value
from sa_sean_m.special_analysis_1.paid_orders as paid_orders
group by 1
) u on c.customer_id = u.customer_id
when matched then update set first_order_date = u.first_order_date, most_recent_order_date = u.most_recent_order_date, number_of_orders = u.number_of_orders, customer_lifetime_value = u.customer_lifetime_value
when not matched then insert (customer_id, first_order_date, most_recent_order_date, number_of_orders, customer_lifetime_value) values (customer_id, first_order_date, most_recent_order_date, number_of_orders, customer_lifetime_value);
Paste the SELECT statement for customers into a new file and save it as customer.sql
in your dbt project.
Replace the hard-coded table reference with a ref
function pointing to the paid_orders
model: {{ ref(‘paid_orders’) }}
Under the hood, dbt uses the ref
function to:
- Run models in your project in the correct order.
- Write results to the appropriate (prod or dev) environment.
- Generate a lineage graph that maps dependencies between your models.
We'll share the final, refactored model in the section below.
Add Packages and Sources
For the next steps in the demo, Sean suggests using two packages – libraries of code that help automate or simplify common routines.
Define Sources
First, let's use the codegen package as a shortcut for setting up sources.
Declaring sources in dbt enables you to:
- Identify models that depend on a given data source in the lineage graph: This can be useful when scoping the impact of a change in that source.
- Check source “freshness”: This takes a bit of extra configuration, but can help you define SLAs and monitor the health of your data pipelines.
The codegen package can make source setup a bit easier, by generating valid YAML syntax to save you the trouble of writing it by memory and from hand.
Steps:
- Install codegen from the dbt Hub.
- Add the name and version to
packages.yml
in your dbt project. Rundbt deps
to install the package in your environment. - Paste the macro below into the Cloud IDE and compile:
{{ codegen.generate_source('jaffleshop', database_name='') }}
This will generate valid YAML syntax for the sources in your schema. Copy and paste this into a new file, and save it as sources.yml
in your dbt project.
With sources declared, return to the two models you've created, and:
- Replace hard-coded table names with variables for each source.
- Pull the new
country
column requested by your boss into each model.
The refactored paid_orders
model:
{{
config(
materialized='incremental'
unique_key='order_id'
)
}}
WITH payments as
(select
order_id,
max(created) as payment_finalized_date,
sum(amount) / 100 as total_amount_paid
from raw.stripe_payment
Group by 1)
SELECT
orders.id,
orders.user_id,
orders.order_date as order_placed_at,
orders.status as order_status,
payments.total_amount_paid,
payments.payment_finalized_date,
customers.first_name,
customers.last_name,
customers.country
FROM {{ source('jaffleshop', 'orders') }} AS orders
LEFT JOIN {{ source('jaffleshop', 'customers_with_country') }} AS customers
ON customers.id = orders.user_id
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where order_placed_at = select max(order_placed) from {{ this }})
{% endif %}
The refactored customers
model:
{{
config(
materialized='table'
)
}}
select
customer_id,
country,
min(order_placed_at) as first_order_date,
max(order_placed_at) as mosta_recent_order_date,
count(order_id) as number_of_orders,
sum(total_amount_paid) as customer_lifetime_value
from {{ ref('paid_orders') }}
group by 1
Validation
Finally, compare the run results of the new, refactored customers
model to the customer data generated by the old query. Sean suggests the audit_helper
package to simplify this routine and auto-generate a SQL query for this comparison.
Steps:
- Install
audit_helper
from dbt Hub:- Add the name and version to
packages.yml
in your dbt project. - Run
dbt deps
to install the package in your environment.
- Add the name and version to
- Copy the compare_relations macro provided on the package page, and paste it into a new Developer tab in dbt Cloud.
- Modify the macro for the test circumstances:
- Replace the
a_relation
value with the name of the old customer table (created by the legacy query). - Replace the
b_relation
value with the name of the new table (created by the new, refactored dbt models).
- Replace the
- Replace the
primary_key
value with the key for both tables – in this case thecustomer_id
. - Click Preview. This will compile the macro to a SQL query that compares the two tables named above, run it under the hood, and return the results to your screen.
You can check out Sean's validation results (at 25:42) with a replay of the demo below.
If you'd like to dive deeper into this topic, we also offer a (free!) self-paced course on Refactoring for Modularity with more examples and instruction.
Participant Questions
After the demo, Sean and Benoit answered a range of community member questions live.
To hear all of the Q&A, replay the video (starting around 32:00) and visit the #events-dbt-live-expert-series Slack channel to see topics raised in the chat.
A sample of questions:
- (32:47) - When will Python be supported in dbt?
- (36:00) - How can I help my team move towards an analytics engineering mindset?
- (39:46) - What's the easiest way to get teammates set up on dbt CLI/Core?
- (42:30) - How can I decompose a very large model without making it less readable?
- (45:04) - Why does dbt recommend CTEs? Do CTEs hurt query performance?
- (46:45) - How should I model my data, to optimize queries in downstream BI tools?
Enjoyed this demo and ready for more?
We have additional sessions lined up in the coming weeks – register to hear more from the dbt Labs Solutions Architects team.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.