Optimizing query run time with materialization schedules
Oct 29, 2021
LearnHow my team decreased query time in Looker and saved Snowflake credits by materializing views and tables and implementing tags for job schedules. I will also discuss how we intend to use the dbt_artifacts
package to monitor those job and test runs.
Full transcript #
Julia Schottenstein: [00:00:00] Welcome everyone to your next session at Coalesce: Optimizing Query Runtime with Materialization Schedules. I’m Julia Schottenstein and I’m part of the product team at dbt Labs. And I’ll be your host for this talk. You’re about to hear from Ola Canty who’s a business analyst at Firefly Health, and she will spend the next half hour walking through some practical learnings on how to materialize your models for best grade performance and importantly, when to schedule jobs to meet your business stakeholders needs. This is a great session for those with lots of data looking to improve usability and performance of their data stack with dbt. Please join the conversation in the Slack channel Coalesce materialization schedule. And Ola will answer any questions after the session.
This is also a great time to share any performance hacks you’ve discovered with the rest of the community in the chat. I’m super excited for this talk. So without further ado, over to you, Ola.[00:01:00]
Ola Canty: Hello, I am Ola Canty. And today I will share with you an interesting project I worked on at Firefly Health. As background, Firefly is a healthcare company that focuses on virtual primary care and health care coverage. As a primary care provider, it is necessary to maintain quick and easy access to patient information and operational data.
With this in mind, I will share how we went about optimizing query run times to increase performance.
I will begin with an explanation of how our data stack transformed from the pre dbt days. Then we will take a look at the red flags that alerted us to a problem with our query runs. Next, we will dive deep into our approach to remedying these issues, particularly using tags for schedules. Finally, we will round out with the results of our [00:02:00] efforts and.
Now let’s get started.
As our startup grows in both customers and employees, the amount of data that we must store and analyze grows even faster. Before introducing dbt our analytics relied on the trio of cloud and analytics products. Fivetran for extraction and loading, Snowflake for storage, and Looker for transformation and visualizations.
By adding the layer of dbt to our stack, we are better able to manage and organize the expanding data transformations needed. As the business grows much of the configuration for dbt resembled the out of box setup, eventually leading to long dbt job runs and very long, low times. This indicated that we needed to take steps to optimize our data [00:03:00] management and/or queries to reduce these inefficiencies.
[00:03:04] The latest data stack #
Ola Canty: With the introduction of dbt to the data stack, we also had an opportunity to reimagine our approach to managing data over the course of a few months. We started to instill order into our data so that the growing amount of information in the system would not be so taxing to interact. This involved making refinements along the way for more optimal data management, including transitioning. Most types of data transformations performed and Looker into dbt models and reorganizing our dbt models into staging versus marts, which is a dbt best practice.
Okay. Have you ever dealt with this situation? You’re sharing your screen during the zoom meeting and you click on a chart to drill in deeper, but you get the spinning wheel. [00:04:00] The wheel keeps turning and one minute feels like forever as everyone stares at your screen, waiting for those amazing insights you told them would be uncovered by the drill down.
We knew that there was truly a problem when our stakeholders would load multiple tabs for Looker before a Zoom meeting to avoid getting stuck when pages that took far too long to load. We found that this was due to very long load execution times for each of looks and dove deeper into the root cause.
[00:04:35] Climbing credit utilization #
Ola Canty: One of the first signs that the slowness in Looker was due to our query methods was that the Snowflake credit utilization was steadily growing. We had been assuming that much of this was due to our growing analytics team, running more queries and our expansion of curated models in dbt. We were prepared to resign ourselves to rise in costs even more quickly [00:05:00] because we intended to grow the team even more quickly.
And we were all in for curating data models, but did this really have to be our fate?
By diving into the Looker metadata, it was obvious that some models were referenced very often. And unfortunately those models had a series of complex calculations to perform every time the models loaded because they were being materialized as views in Snowflake. This is an explain plan from Snowflake of our most commonly referenced model users.
Since it contains basic details of our patients as a view, this would have to be calculated each time outside of a cached session in Looker.
[00:05:59] Steps toward the solution #
Ola Canty: Now I am [00:06:00] going to walk us through the steps we took to resolve our query issues. First, we will take a look at the fairly simple process of materializing tables. Next, now that we no longer used views, we needed to determine schedules for the tables to be recalculated. Last, there had to be an easy way to roll out multiple schedules for maximum flexible.
Our models were all being materialized as views a remnant from the default setup in dbt and the need to know now mentality of the startup. However, our models were beginning to use more complex calculations and were more tightly entwined than before. Meaning that even the simplest looks and Looker required extensive calculating at each load.
So first we [00:07:00] changed the default for our models to materialize as tables, meaning calculations were only performed at the scheduled times instead of ad hoc. When visuals were rendered, this was very simple to set up in the dbt project Yammel file. The difficult part would be replicating the near real time data.
And as you can see in the lower bottom corner, we have set up the models with your project name to be materialized as table. You just have to specifically and explicitly call for it.
So I mentioned the need to know now mentality earlier, but that of course means we needed to solve any issues that would arise from this. There were two aspects to this mentality. First, our stakeholders were accustomed to near real time data, which is honestly a five minute delay [00:08:00] due to the Fivetran schedules.
And it would be difficult to change this mindset. Secondly, as a healthcare provider, there is data that must absolutely be refreshed frequently to avoid missing patient concerns.
The next step was to identify true data urgency and bucket them into various schedules. Along with stakeholders and fellow analysts, I gathered information about the data schedules and designed a worksheet that laid out all models and possible. The default schedule, i.e., if no one mentioned a need for a particular frequency, was set to once a day while the special schedules were 30 minutes, four hours and monthly.
In this graphic, everything has [00:09:00] been organized according to its urgency. Blue items need to be fairly up-to-date and would be synced every 30 minutes, yellow items, every four hours, gray, monthly, and red must be as current as possible and would continue to be materialized as views. These schedules make sense when you consider its content appointments are constantly being scheduled and we should be able to see new appointments or cancellations very quickly in order to gather needed information from patients.
But any information we need instantly about appointments would be reflected in the medical record system. And this data was primarily for operational purposes. So 30 minutes is a sufficient sync [00:10:00] schedule. On the other hand, we only receive claim files once a month and it makes no sense to sync more frequently.
Some models absolutely need to be near real time, like patient chat messages, which we aim to reply to within 10 minutes and form responses that alert us to potentially dangerous health concerns. We set aside this bucket of models to be materialized as views after assigning all of the categories. I also ensured that any upstream models were refreshed at least as frequently as downstream models that relied on them.
Once the appropriate refresh schedules were identified, I then needed to find a way to implement those schedules. The solution needed to be easy to implement, easy to maintain and easy to proliferate.[00:11:00]
[00:11:03] The solution: Tags #
Ola Canty: Tags were applied to every model in the dbt project. Yammel file that followed a prescribed schedule. Note, I am sharing screenshots of what this looks like in the file, because we definitely made mistakes along the way. Here, you can see the layout that we followed. First, we needed to progress from general rules to more specific rules.
I know this is obvious, but the materialization as a table was set as the most general rule in the previous step. And by not defining a tag that inherent refresh schedule would be daily. These tags count as the specific rules. Also, indentation is incredibly important and you will not get an error warning if you get it right.
The indentation needs to follow the nesting pattern of your [00:12:00] directories. These are screenshots of our directories along the left with models, leading to marts, which includes the communication directory containing files related to chat messages, phone calls, and such. This pattern is reflected in the dbt project Yammel file on the right.
In fact, if you want all of the files within a directory to follow a specific schedule, you can tag that directory the same way you would tag a single file. So that’s a nice, easy way to get around tagging every individual file within a directory. Now for another helpful hint. I tested whether I properly implemented tags by running dbt run across the specific tags.
So you just specifically call to run the tag, keeping a close eye [00:13:00] on the progress of the run. I could tell whether an intended model was left out or if too many things were coming.
Now to the fun part, implementing those tags. Jobs are set up according to each schedule to specifically run the models based on their tags. After some trial and error, we chose to only run dbt tests during the daily job to cut down on the time spent on each job, which is especially a concern for the 30 minute job, which has limited time between runs.
The daily job explicitly excludes schedule tags and the run, but test all models, regardless of whether they have scheduled tasks. This run also automatically syncs the code for models, materialized as views. On the left, we have the 30 minute run to trim down the time [00:14:00] of the run. In our case, nine to 14 minutes, generate docs is toggled off and instead, specifically call to generate docs without compiling. On the right is our daily job.
Generate docs is toggled on to capture comprehensive documentation and all of the dbt cleanliness steps that are included. We also have a few other fun operations running. The one you see at the bottom drops the tables and views that no longer have existing models in dbt.
So when making major changes like this, communication with stakeholders is key. During the first week of rollout, we encouraged people to let us know if they needed some things to be run more frequently. Or if something seemed off with the data. For instance, [00:15:00] if we were upstream, if there were upstream dependencies, we hadn’t considered, for example, we came across situations where some models that fed into the work list workflow were not primarily or explicitly used for work.
And those had unintentionally been left out of the 30 minute refresh and caused some wonkiness in the resulting models. Because we use tags to manage everything, it was very easy to make changes that only required updates to the Yammel file.
[00:15:39] Win: Decreased average query time #
Ola Canty: Huge win. These changes along with some additional adjustments were implemented mid April. It made a huge difference in Looker query time and Snowflake credit usage. Before we implemented materialized views, the average query execution time was about 0.5 four seconds per query, and you can [00:16:00] imagine hundreds of queries running, in March before the tags were added.
And it immediately decreased to 0.39 seconds the month that the changes were made and even lower in May, the first full month of tags.
And you can also imagine a risk of running queries on a schedule is that you may end up running more queries than before and using more credits as a result. However, we saw somewhat different results. Of course, the number of queries has risen constantly over the months. However, the number of credits per query and in this visualization, it is broken as credits per 10,000 queries, has decreased.
This suggests that by intentionally running the most [00:17:00] important yet burdensome models on a set schedule, we ultimately decreased the amount of time spent running queries. This provides us a lot more flexibility in terms of data growth, since we can get far more queries for the same or fewer credits.
[00:17:25] Things learned #
Ola Canty: Okay. Lastly, for things that had been learned after implementing the materialization schedules, we came across the dbt artifacts package. This package basically takes the documentation generated during job runs and tests and extracts information about runtimes, which models were run and lineage for further analysis, such as being able to visualize it in Looker.
I wish I had set up the dbt artifacts package before making all the [00:18:00] changes to materializations so that we could have used that data to track the impact of our changes. I especially would have liked to track the overall amount of time spent in job runs throughout the day before versus after materialization changes. It would have been optimal to set up a pre-post test first implementing dbt artifacts on the daily job to gather baseline data about the model run. And also information about the basic timing of the dbt artifacts itself. Since the package adds a bit of time to the run, then it would have been easier to measure the impact of materialization schedules.
Also, it would have been easier to identify specific models that could use some code optimization for efficiency in cleaning. But alas, you would just have to use dbt artifacts for its ingenious extraction of metadata for basic [00:19:00] analytics operations. So I went a little quicker than expected. However, that is all.
Thank you very much for taking the time to join the session. It has been an honor and pleasure to share a solution for something that plagues many data engineers. I will now open up the floor to questions.
Last modified on: Oct 21, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.