Automating Ambiguity: Managing dynamic source data using dbt macros
Oct 29, 2021
LearnWe’ve been huge fans of dbt since overhauling our data stack 2 years ago. As product development cycles have increased in velocity, though, I began to notice that changing data schemas were causing huge amounts of work in order to update downstream reporting for execs and other teams.
Behavioral data is never static, but it is non-trivial to build a system that requires little or no maintenance for data engineers and analysts. In this talk, I’ll tell the story about our stack, our changing schemas and how I used dbt to make life easier for both the engineers instrumenting tracking as well as the executives consuming downstream reporting.
Follow along in the slides here.
Full transcript #
[00:00:00] Amada Echeverría: Welcome everyone. And thank you for joining us at Coalesce 2021. My name is Amada Echeverría. I use she/her pronouns and I’m a developer relations advocate in the community team at dbt Labs. I’m thrilled to be hosting today’s session, Automating Ambiguity: Managing Dynamic Source Data Using dbt macros presented by Eric Nelson.
Eric Nelson is a senior analytics engineering consultant at Big Time Data. His previous experience includes analyst and data engineering work at Salesforce Heroku, and Mattermost. Eric has been a huge fan of dbt since overhauling Mattermost’s data stack two years ago, and continues to use dbt in his current role when he’s not busy developing production-ready analytics infrastructure.
You can find him playing golf or running on his hamster wheel, re treadmill. In this 30 minute session, Eric will tell the story about Mattermost data stack, they’re changing schemas and how he used [00:01:00] dbt to make life easier for both the engineers, instrumenting tracking, as well as the executives consuming downstream reporting.
According to Eric, this talk could have easily been titled How to Keep Engineering and Executives Happy with Real-Time Reporting. So before we jump into things, some recommendations for making the best out of this session. All chat conversation is taking place in coalesce-automating-ambiguity channel of dbt Slack.
As you can see, it’s disabled in Crowdcast so please head over to Slack. That’s by design. If you’re not yet a part of the dbt community Slack, you have time to join now. Seriously go do it getdbt.com/community and search for Coalesce automating ambiguity when you arrive. We encourage you to set up Slack and your browser side-by-side.
In Slack, I think you’ll have a great experience if you ask other attendees questions, make comments, share memes, or react in the channel at any point during Eric’s [00:02:00] session. And if you’re new to the dbt community, or this is your first time at Coalesce, please don’t be shy. It’s very important to hear from new voices.
So if you feel comfortable doing so, please chime in. To kick us off, our chat champion, Lauren Benezra an analytics engineer at dbt Labs, has started a thread to have you introduce yourself, let us know where you’re calling in from and share a random skill you picked up at some point that has proven far more valuable than anticipated.
After the session, Eric will be available in the Slack channel to answer. Let’s get started. Over to you, Eric.
[00:02:37] Eric Nelson: Hi thank you for having me today. I appreciate you guys all attending my session. Now can you guys see my screen? I believe I’m sharing it. Yep. Okay, perfect. So thank you for joining my session on automating ambiguity or managing dynamic source data using dbt macros.
And so in my previous role at the dev team was huge. It’s an open source product. And so the contributors also made a lot [00:03:00] of changes and that in turn made a lot of changes to our data structures. And so managing that dynamic data was becoming really cumbersome for me. So that’s what this discussion today will be about.
And so let’s just jump in and the agenda for today. I’ll give you a little bit background, I’ll discuss the problem statements that were facing me as a data and analytics engineer at Mattermost and just broadly, and then I will discuss the tech stack that I worked with at Mattermost as well as my proposed solution that helped me kind of reduce the overhead associated with managing all of the data pipelines and the downstream transformed analytics ready infrastructure.
And then I will do an overview of the macro or the series of macros and then a walkthrough of the kind of individual code snippets as well as a live coding exercise while we’ll execute the macro with the data model and then we’ll review the log files and I’ll show you the output and kind of [00:04:00] discuss a conceptually and how it could really help reduce workloads on a lot of analytics engineers, so they can focus on other more important things. And then I’ll talk about some of the limitations and potential feature additions if I have time to iterate and refactor the macros.
All right. So let’s just jump on. A little bit of background about myself. I was working at a Mattermost as an analytics and engineering lead. And I was responsible for really all of the ELT processes around a Mattermost’s business systems, as well as their website and product usage data. And their website, product usage data was captured via Rudderstack SDKs that embedded. And so that is one of the things that we will discuss later, Rudderstack and how they produce data.
And then another role of mine was end-to-end analytics infrastructure design and maintenance. So really from the downstream data pipelines, all the raw data to how we were sinking that raw data with Snowflake, how we were transforming our data with dbt and then surfacing those transformed data models in our BI visualization tools.
And then another kind of side [00:05:00] project that I frequently worked on at Mattermost was AB testing. So the experimental design and monitoring, and in order to accurately track and report the metrics that we were trying to influence for AB testing, a clean, reliable transformation of data was really essential.
[00:05:16] Problem: Data is messy #
[00:05:16] Eric Nelson: And so let’s get into the problems themselves. So the first problem is data’s messy. Whether you’re producing production data, whether it’s data release candidate, data it’s going in different schema has different relations. It’s going everywhere in your data warehouse and making sure you have all of it tracked and you’re surfacing it all relevantly, it’s difficult. And then next it’s never static. You’re constantly producing more schemas, relations, and properties within those relations. And tracking all of that is difficult to do without snapshotting saying information schema table to understand when these changes occurred in the nature of these.
And then data governance is just plain hard. At Mattermost, we are hiring developers on [00:06:00] what seemed like a daily basis. And so having them onboarded and ramped and aware of the minute processes around how to make product changes and how those product changes affect the downstream raw data that surfaced in the data warehouse, it’s incredibly difficult to do.
And so to simplify that, this macro has really at least been able to achieve that. So I didn’t have to act as a whip and constantly enforce data policy as much. And then data volume and velocity increased as businesses scale. And so as this data increases it’s harder and harder to query and analyze. It consumes more compute resources, and you really have to figure out a way to aggregate and surface the data so that it really doesn’t utilize as much compute and therefore less money and less costs associated. Cause a lot of the databases these days are charging per resource or per uptime minute. So it’s really important. And then next maintaining and transforming [00:07:00] raw dynamic data becomes too time consuming.
You spend all this time tracking down the changes, altering your model files with the new properties or new tables that need to be included in the aggregated transform table. And then doing alternative table, add column commands, and then back-filling data. And it’s just, it was becoming like 50% of my work.
And it wasn’t enjoyable work, so I figured let’s fix that. And dbt was there to help. Alright. And so the problems faced at Mattermost specifically for me was really when data’s messy, how do we efficiently track new data sources and properties generated this product development cycles increase in velocity.
How do we analyze increasing types in volume of customer engagement data, we’re producing new properties capturing different user attributes as well as properties capturing different components of events that are firing on our product. And there’s new fields being added to Salesforce regularly. So how do we [00:08:00] make sure we’re tracking all of that?
And then how do we handle variations and schemas naming conventions and data types? One small change to say, a plugin name and the code base completely changes how it was parsed and interpreted by the data warehouse when it was being uploaded as raw data, creating duplicate. How do we make sure that we can account for all of this?
And then how do we process larger amounts of data using minimal compute resources? Like I discussed earlier, these are all kind of the questions that plagued me. And then lastly, again, maintaining and transforming raw dynamic data becomes too time consuming.
[00:08:36] Our tech stack #
[00:08:36] Eric Nelson: So let’s just go into the tech stack before we touch on the actual solution. So our tech stack really consisted of kind of the business systems and the user platforms. You can see it’s pretty extensive. You have Marketo, Salesforce, Google Analytics, NetSuite, Stripe, all for billing, and then you have the actual web properties and the Mattermost products. And so that’s where the Rudderstack SDK came in.
And then you can see this middle ground here [00:09:00] where the stitch, Amazon S3 and Rudderstack icons are in the middle, and those are the extraction layer. And then you load into our stuff like data warehouse into a specific database for raw data, which is then transformed using kind of schedulers and tools like dbt, dbt Cloud and Kubernetes, and then that’s reinserted into our production-ready database. And then that is then surfaced to our BI and visualization tool Looker so that we have a uniform data. And it’s all being processed before being surfaced in Looker. So it’s a lot more scalable from an analytics perspective.
And then let’s touch on a little bit more about Rudderstack. It’s a customer data platform. It’s similar to a Segment or a Bloomreach if you’re familiar and it’s SDK provides functionality to enable multi-platform tracking of user engagement. So web, desktop, mobile, and other digital mediums, you can use Rudderstack SDK to track usage across all of those, if it’s instrumented correctly and then you can generate and [00:10:00] store data, or at least the SDK does, it generates in default schemas relations and properties. And then you can add on to those schemas or relations or add custom properties. And this customizability to capture customer engagements was really great for developers and it provided them a lot of flexibility to add new things that we were trying to track leveraged to improve the product. But it also made it incredibly difficult and cumbersome as an analytics engineer to track how dynamic the data was becoming and all of these new properties that were constantly being produced.
[00:10:33] Solution: One dbt macro to rule them all #
[00:10:33] Eric Nelson: So the solution: one dbt macro to rule them all. Basically what this macro is able to accomplish was, it addresses each of these pain points that I discussed earlier and provides a solution for it. So tracks, changes, to schemas and tables and by cross-referencing the information schema data, it looks at your source tables that you’re targeting and then the target table, and it compares the columns [00:11:00] and the data types and ensures that all of them are present.
And then next we have, it automatically identifies changes to source tables and updates master. And so this is huge. So it basically runs that altered table add column command for you in the background. So you don’t have to worry about that. And then it loads the data after executing that command. So there’s no backfilling either.
So you’ve captured right when it occurs and it loads the data as it occurs. And then again, it creates a nested visual relationship between using a custom property. So I call it the dbt source relation. You might, if you’re familiar with the dbt utils union relation macro.
Conveniently enough kind of the same name as union tables as the one I’m using. It basically has a dbt source relation where you can track all the downstream tables that are being referenced in this master or this target analytics table. And so you can see essentially where the data’s coming from where there could be some issues in data volume [00:12:00] or any unexpected changes.
And this nested visual relationship can also help you troubleshoot when developers think there may be a bug or. and then next it provides logic for incremental builds and full refreshes. And so this is huge. Like I was saying, it removes the need for those alter table commands or alter table add column commands.
And it adds it in advance and does that loading of the data so that you don’t have to do a full refresh every time you want to add a new column or some structure changes. It will be able to do that for you and account for that stuff. And so it reduces those compute, those resource necessary.
And really how to accomplish this, all of this is it creates this column, super set dictionary, where it has the name and the data type. And then it uses those to essentially identify overlapping columns between the source tables and any dummy column requirements that are cast has no values, but with the table alias or their column and so it can account for tables where the number of columns don’t match, because that’s really [00:13:00] important for when you’re doing a union statement. You need the exact number of columns to be the same in each of the select statements to be union.
[00:13:11] Union macro logic #
[00:13:11] Eric Nelson: This is going into the nitty gritty now of a union macro. This is let’s call it the 50,000 and we’ll go down further as this presentation progresses. But so basically what it does is it’s a series of macros. I call it a single macro, but it’s kept in a single file. And the macros, a repository or directory and the macro essentially checks if it’s an incremental build. And if it is an incremental build and then it compiles a list of all of the columns, all of the distinct unique columns in the source tables. And then it looks at the target table and it compares them. And if the column counts differ then it identifies which columns are not included in your master, your target table. And then it executes the next command down from there, which is the alter table, master table, add columns, and then it [00:14:00] executes a four loop that goes through all of the missing columns and their data types and inserts them and then executes that command. And then after that it will execute the macro itself or the unioning script portion of the macro. And you can see this is a visual aid into what it’s doing.
So at Mattermost the way that we had set up our mobile events compared to our web app desktop, those were coming into a single relation in a schema. But our mobile events, each event produced a different relation and we had over 200 events. So think about that. That’s 200 tables and each of those tables had probably I would say 50 properties each. A lot of them were overlapping, a lot of them were the default standard properties produced by Rudderstack. And so we needed to be able to report on all of our user engagement in a single table, not just by the various kind of mediums or clients. And so what we were able to do was with this union macro essentially takes all of those, say 100, 200 mobile client event tables, our desktop and web app, client event tables, and [00:15:00] then it unions them all together, creating dummy columns for certain tables where say a column doesn’t exist in one, but exists in another and cast them as no values. And then it outputs a single master user events which is what we were able to transform further to surface say our monthly active user data or our specific event based queries when we’re looking at, how many people are performing events on specific clients. And you can even go further into the user agents of those clients, if it says web app and this just provided a single table to drill down.
[00:15:39] What’s the big deal? #
[00:15:39] Eric Nelson: And again, I’m just going to reiterate, what’s the big deal here? It eliminates the needs to write a lengthy SQL script and then to maintain that lengthy SQL script the user events table with Mattermost ended up being, I want to say, close to a thousand lines of code and to troubleshoot that and to make changes to that that would almost be a full-time job in itself.
And so creating this macro and having it [00:16:00] basically do all of the operations to maintain was huge for me. And again, like I said, the model is using the macro, the model itself. The model file has maybe four lines of actual code referencing those macros while the actual script underlying that is huge.
So it really takes away a lot of that overhead and having to use your eyes to troubleshoot things. And again, and no more DDL, the model mass dependencies, you put a depends on statement in the file, knows what it’s looking at and knows what it’s depending on, and thus, if anything breaks downstream a and troubleshoot accordingly.
And I will try to load data if something does fail upstream. So basically it automatically detects those new columns. Reiterating this, no more alter table add column commands, the bane of my existence. And backfilling, it’s the worst. If you catch something 30 days down the road and you have to backfill 30 days with the data, but say it’s your event data, that can take several hours especially if you’re not willing to use say a larger warehouse that provides more compute resources.
[00:17:00] So reducing all of that time spent, I think it gave me back, I don’t want to say, 20 hours a week. And so now let’s get into the nitty-gritty kind of the 10,000 foot view of the macro itself and the model file leveraging the macro. So keep in mind, everything I’m showing you is actually a kind of staged data because I know our work at Mattermost I can’t leverage the existing infrastructure. So I set this all up in advance. So I created this kind of dummy user event table and show you how it references the macro itself.
And so there are two separate macros where you use a rudder relations variable using the Git source, target relation, less macro, and then a union relations macro. And there are different arguments in each. But essentially what you need is a list of the schema is that you’re targeting. So say for mobile events, we had one schema. For desktop events, we had another schema and then for web app events, we had another one. And so you would include that in the schema argument, and that would allow you to index all of those schemas and look at the tables within them. And [00:18:00] database, it’ll default to the target database and your profiles, that YAML file.
But here I just specified it and then table exclusions and a table inclusions argument, which allow you to basically say if you use table exclusions, you’re basically pulling in all tables, except the ones you’re specifying in that exclusion. And that’s really important for say the mobile events where we have 200 events relations in that schema and then maybe six tables that we didn’t need.
So we would use table exclusions, but then say you’re targeting a schema where there’s only going to be a subset of tables, then you can use the table inclusions and that will allow you to specify just those key target tables and we’ll admit everything else. And then union relations will take the output of that and the rudder relations variable, and will parse it out into there’s a much cleaner way of doing this, but this is just how it stands down is to be refactored basically parses out the list that’s returned, so you get a rudder relations, which is essentially the so the [00:19:00] rudder relations zero is essentially the list of your source table which allows it to parse through those tables, get all of the column requirements. And then the target relation kind of speaks for itself where it is the target relation, which is the user event, telemetry table, the model file you’re looking at here.
Then you can see the depends on the config stuff that allow you to specify whether or not it’s a default as an incremental build and model files that it’s depending on are the nodes. And so basically what happens it’s kinda hard to see this code, but it essentially calls the information schema to retrieve the source tables.
And then from those source tables, it looks at them, creates relation objects and then returns them. And that’s what’s referenced in the union relations macro and I apologize for the size of this. Let me Stop my camera and give you a little bit more to see still really hard to see. I wish it was a little bigger.
But what it’s doing is it instantiates the [00:20:00] dictionary and less variables and iterate through the target relation list. And it checks if it’s a valid relation. And if so, then it retrieves the columns. And then after that, it loops through the source and target columns and identifies the missing values.
And so this is if it’s an incremental build and all those values have already been inputted into the union relations macro. I’m sorry if you’re losing your hair. I know it’s hard to follow along but basically this is the portion of the macro that’s checking and then adding new columns if they occur.
And so what it would do if it identifies missing columns is it creates a dictionary containing those missing column names and data types. And then it basically uses that to loop through and execute the alter table script. And if not, it basically executes an empty script and redirects you back to the union script.
And the union’s script is a much larger script also. Too small to really see. And I apologize again. Where it in addition to analyst variables that iterates through those target relation lists and just checks if they’re [00:21:00] valid. It loops through the source columns and creates that super set dictionary where it’s all the unique column names and data types contained in each of the source tables or all of them.
And then it checks if it’s being run incrementally. And if so, then it executes that add new columns macro, which is the one we were just referencing. And if it’s not, then it just executes the entirety of the unioning script and does a full refresh. And so that union script portion, it’s a very customizable query section.
And in the future additions, I want to basically be able to add like a separate argument where you can specify a specific script type. So say if you want to aggregate to a daily average or monthly, weekly, you’d be able to do that. But here it leverages CTE, iterate through the table and column lists, and essentially creates all of the CTEs necessary and then references each of those CTEs in a union loop, which you can see down here at the bottom of the file.
So that is what you’re seeing [00:22:00] right here. And so you can see it loose loops through all of them, cast them as the appropriate data type and column. Yeah. And then if it’s incremental, it only moves data that was received at greater than the max received that time since the last upload. And so that’s really the value there.
[00:22:15] Current limitations and future additions #
[00:22:15] Eric Nelson: And again, current limitations and future additions. So let’s go into that a little bit. I’ve discussed it. Oh, I totally forgot the live coding portion. So let me show you guys the actual execution of the macro itself. What you can see here is I’ve already typed it in. So dbt run -n user event telemetry.
And so basically what this is doing is I’ve created dummy data already that we can reference. And you’ll see, it’ll populate this log file, which is incredibly small. Let’s see if we can blow that up a little bit. But no, not really. So I will show you essentially what you’re seeing in this log file is you’ll see the alter table command [00:23:00] here.
It’s gross. I’m sorry it’s so small but that is adding a new column that I added to one of the source tables prior to this. And then after it adds that table, then it creates the temporary agency to load into. And so what you can see is it’s incremental right now. So it has the max timestamp CTE, and then it creates the desktop events, CT, as well as mobile events, as well as the web app events.
And you can see all of the columns that it’s creating and casting properly. And as you can see, it’s a very large script. All right. And this is a very simple version of it. And then you see the union right here, where at unions, those three base CTEs together to create that table to load into the existing user back telemetry table.
Alright, so let’s get back to the current limitations and future additions. Right now, they only support Snowflake, Postgres and Redshift, or at least databases that have an information. And so expanding [00:24:00] that I’m not sure if, tools like databases, like BigQuery. I haven’t tested them yet would necessarily work with this.
And it all depends on what you have in your profiles dot YAML file. And then also adding a method to incorporate, like I talked about earlier, additional transformation logic for the union script. So there’s an additional arguments, say like a script. That would allow you to specify what scripts you want to execute.
It would essentially still do the union thing, but it would bring it up to a different level of aggregation and helps simplify your job further. And then a methods unified column names and column naming conventions. And so that would be huge for me because Mattermost was constantly changing plugin names, constantly developing new products, creating new properties based on how Rudderstack and other tools parsed special characters.
And so being able to unify all of that coalesced and then cast as a single alias would be hugely beneficial and help condense the table size. And ultimately the compute resources used to query those tables. And yeah, so that is my presentation today. I really [00:25:00] appreciate everyone’s time and I hope it was informative enough and I didn’t lose you too much in the weeds there. All right. Time for a Q & A.
Last modified on: Oct 21, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.