dbt
Blog How to Refactor Legacy SQL for Modularity | dbt Labs

How to Refactor Legacy SQL for Modularity | dbt Labs

May 24, 2022

Product

What is dbt Live?

The dbt Live: Expert Series consists of 60-minute interactive sessions where dbt Labs Solution Architects share practical advice on how to tackle common problems that they see over and over again in the field — like how to safely split environments with CI/CD, or how to implement blue/green deployments.

The event agenda is shaped by audience requests – so whether you’re new to dbt or looking to brush up on a specific topic, we’d love to see you there! Register to join the next live session.

Session Recap: Refactoring Legacy SQL

The EMEA-friendly edition of dbt: Live featured two experts from dbt Labs:

  • Sean McIntyre, Senior Solutions Architect
  • Benoit Perigaud, Senior Analytics Engineer

Sean started things off by tackling a problem many folks find daunting: refactoring legacy SQL scripts into smaller components that can move into dbt. You can watch the full replay here!

Let's start with why

Why might you take the time to refactor and migrate?

  • Version control: Enable others – or “future you” – to edit or share data transformations without resorting to email.
  • Readability: Help others – new team members in particular – learn what the code is doing.
  • Modularity: Break monolithic transformations down into smaller components that can be tested and reused more easily.

Scenario & Overview

It’s your first day as an Analytics Engineer at Jaffle Global. Your boss wants to filter customer data by a new dimension, country. He emails you the script that generates the current customer table – hundreds of lines of code – and asks you to figure out how to add the new column. How do you start?

An overview of Sean's approach:

  1. Review the legacy script to understand its intent. Break it down into smaller stages of transformation.
  2. Move the pieces into dbt and relate them to each other.
  3. Define sources. (Bonus lesson: using packages to save time and effort!)
  4. Test and validate results.
  5. Commit the changes 🎉 and stick around for Q&A.

Step-by-Step Breakdown

Start by copying the entire query that needs to be refactored and pasting it into the text editor in the Cloud IDE for easy reference.

Sean uses dbt Cloud to push changes from his text editor to his connected code repo in Github and data warehouse in Snowflake. You can adapt his steps if you prefer to work locally with dbt Core, or use a different IDE or cloud data warehouse connected to dbt Cloud.

Build A Model for Paid Orders

Copy the SELECT statement for order data from the legacy query above. Save it into a new file in your dbt project named paid_orders.sql .

select
    Orders.id as order_id,
    Orders.USER_ID	as customer_id,
    Orders.ORDER_DATE AS order_placed_at,
        Orders.STATUS AS order_status,
    p.total_amount_paid,
    p.payment_finalized_date,
    C.FIRST_NAME    as customer_first_name,
        C.LAST_NAME as customer_last_name
FROM raw.jaffleshop.orders as Orders
left join (
  select "orderID" as order_id, max(CREATED) as payment_finalized_date, sum(AMOUNT) / 100.0 as total_amount_paid
        from raw.stripe.payment
        group by 1) p ON orders.ID = p.order_id
left join raw.jaffleshop.customers C on orders.USER_ID = C.ID

-- incremental loading logic
where order_placed_at > (select coalesce(max(order_placed_at), to_timestamp('2010-01-01T00:00:00Z')) from sa_sean_m.special_analysis_1.paid_orders);

Comment out the incremental logic for now -- we'll come back to this later.

Run this query as-is in dbt, to make sure it returns results before you proceed with changes.

Convert the subquery to a CTE and rename the results set from p to payments , to make its meaning transparent.

WITH payments as
(select
order_id,
max(created) as payment_finalized_date,
sum(amount) / 100 as total_amount_paid
from raw.stripe_payment
Group by 1)
 
SELECT
orders.id,
orders.user_id,
orders.order_date as order_placed_at,
orders.status as order_status,
payments.total_amount_paid,
payments.payment_finalized_date,
customers.first_name,
customers.last_name
FROM jaffleshop.orders AS orders
LEFT JOIN payments 
    ON orders.id = payments.order_id
LEFT JOIN raw.jaffleshop.customers AS customers
    ON customers.id = orders.user_id

-- incremental loading logic
-- where order_placed_at > (select coalesce(max(order_placed_at), to_timestamp('2010-01-01T00:00:00Z')) from sa_sean_m.special_analysis_1.paid_orders);

Run the updated query to make sure it returns results.

Incremental Models

You may want to keep the incremental logic in the model, to optimize performance and limit the amount of data processed on routine runs.

To convert it to dbt’s syntax:

  • Wrap the WHERE clause in an is_incremental macro.
  • Use a "this" variable to refer to the model we're working in.

This revised block tells dbt to process only the latest orders on incremental runs:

{% if is_incremental() %}
 
-- this filter will only be applied on an incremental run

where order_placed_at = select max(order_placed) from {{ this }})
 
{% endif %}

Add a config block to the top of the file, to tell dbt how to materialize results in the warehouse:

To change schema and rebuild the model entirely, pass a full-refresh argument in the run command:

dbt run --full-refresh --models paid_orders

This tells dbt to ignore the incremental logic on this initial run, compile to a CREATE OR REPLACE TABLE statement, and reprocess the entire model this time.

You can run future updates without the full-refresh argument. dbt will then use the incremental logic, process a smaller amount of data, and merge results into the base table created by the initial run.

Build A Model for Customers

Next, we'll refactor a stored procedure for transforming customer data, and relate it to the paid_orders model created in the last step.

Here’s the procedure we're starting with:

create or replace procedure update_customers()
returns text
language sql
as

$$

merge into sa_sean_m.special_analysis_1.customers c using (

    select customer_id
        , min(order_placed_at) as first_order_date
        , max(order_placed_at) as most_recent_order_date
        , count(order_id) AS number_of_orders
        , sum(total_amount_paid) as customer_lifetime_value
    from sa_sean_m.special_analysis_1.paid_orders as paid_orders
    group by 1

) u on c.customer_id = u.customer_id
when matched then update set first_order_date = u.first_order_date, most_recent_order_date = u.most_recent_order_date, number_of_orders = u.number_of_orders, customer_lifetime_value = u.customer_lifetime_value
when not matched then insert (customer_id, first_order_date, most_recent_order_date, number_of_orders, customer_lifetime_value) values (customer_id, first_order_date, most_recent_order_date, number_of_orders, customer_lifetime_value);

Paste the SELECT statement for customers into a new file and save it as customer.sql in your dbt project.

Replace the hard-coded table reference with a ref function pointing to the paid_orders model: {{ ref(‘paid_orders’) }}

Under the hood, dbt uses the ref function to:

  • Run models in your project in the correct order.
  • Write results to the appropriate (prod or dev) environment.
  • Generate a lineage graph that maps dependencies between your models.

We'll share the final, refactored model in the section below.

Add Packages and Sources

For the next steps in the demo, Sean suggests using two packages – libraries of code that help automate or simplify common routines.

Define Sources

First, let's use the codegen package as a shortcut for setting up sources.

Declaring sources in dbt enables you to:

  • Identify models that depend on a given data source in the lineage graph: This can be useful when scoping the impact of a change in that source.
  • Check source “freshness”: This takes a bit of extra configuration, but can help you define SLAs and monitor the health of your data pipelines.

The codegen package can make source setup a bit easier, by generating valid YAML syntax to save you the trouble of writing it by memory and from hand.

Steps:

  • Install codegen from the dbt Hub.
  • Add the name and version to packages.yml in your dbt project. Run dbt deps to install the package in your environment.
  • Paste the macro below into the Cloud IDE and compile:
{{ codegen.generate_source('jaffleshop', database_name='') }}

This will generate valid YAML syntax for the sources in your schema. Copy and paste this into a new file, and save it as sources.yml in your dbt project.

With sources declared, return to the two models you've created, and:

  • Replace hard-coded table names with variables for each source.
  • Pull the new country column requested by your boss into each model.

The refactored paid_orders model:

{{
   config(
       materialized='incremental'
       unique_key='order_id'
   )
}}

WITH payments as
(select
order_id,
max(created) as payment_finalized_date,
sum(amount) / 100 as total_amount_paid
from raw.stripe_payment
Group by 1)
 
SELECT
orders.id,
orders.user_id,
orders.order_date as order_placed_at,
orders.status as order_status,
payments.total_amount_paid,
payments.payment_finalized_date,
customers.first_name,
customers.last_name,
customers.country
FROM {{ source('jaffleshop', 'orders') }} AS orders
LEFT JOIN {{ source('jaffleshop', 'customers_with_country') }} AS customers
ON customers.id = orders.user_id

{% if is_incremental() %}
 
-- this filter will only be applied on an incremental run
where order_placed_at = select max(order_placed) from {{ this }})
 
{% endif %}

The refactored customers model:

{{
    config(
        materialized='table'
    )
}}

select
customer_id,
country,
min(order_placed_at) as first_order_date,
max(order_placed_at) as mosta_recent_order_date,
count(order_id) as number_of_orders,
sum(total_amount_paid) as customer_lifetime_value
from {{ ref('paid_orders') }}
group by 1

Validation

Finally, compare the run results of the new, refactored customers model to the customer data generated by the old query. Sean suggests the audit_helper package to simplify this routine and auto-generate a SQL query for this comparison.

Steps:

  • Install audit_helper from dbt Hub:
    • Add the name and version to packages.yml in your dbt project.
    • Run dbt deps to install the package in your environment.
  • Copy the compare_relations macro provided on the package page, and paste it into a new Developer tab in dbt Cloud.
  • Modify the macro for the test circumstances:
    • Replace the a_relation value with the name of the old customer table (created by the legacy query).
    • Replace the b_relation value with the name of the new table (created by the new, refactored dbt models).
  • Replace the primary_key value with the key for both tables – in this case the customer_id .
  • Click Preview. This will compile the macro to a SQL query that compares the two tables named above, run it under the hood, and return the results to your screen.

You can check out Sean's validation results (at 25:42) with a replay of the demo below.

If you'd like to dive deeper into this topic, we also offer a (free!) self-paced course on Refactoring for Modularity with more examples and instruction.

Participant Questions

After the demo, Sean and Benoit answered a range of community member questions live.

To hear all of the Q&A, replay the video (starting around 32:00) and visit the #events-dbt-live-expert-series Slack channel to see topics raised in the chat.

A sample of questions:

  • (32:47) - When will Python be supported in dbt?
  • (36:00) - How can I help my team move towards an analytics engineering mindset?
  • (39:46) - What's the easiest way to get teammates set up on dbt CLI/Core?
  • (42:30) - How can I decompose a very large model without making it less readable?
  • (45:04) - Why does dbt recommend CTEs? Do CTEs hurt query performance?
  • (46:45) - How should I model my data, to optimize queries in downstream BI tools?

Enjoyed this demo and ready for more?

We have additional sessions lined up in the coming weeks – register to hear more from the dbt Labs Solutions Architects team.

Last modified on: Oct 15, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts