dbt
Blog dbt, Notebooks and the modern data experience

dbt, Notebooks and the modern data experience

Oct 29, 2021

Learn

dbt brings best engineering practices into the world of data & analytics to help you build out a solid foundation.

Data science notebooks help you leverage well-modelled data and create chaos in exploratory workflows. To make any data endeavors successful, you need both - tools that gives you strong foundation coupled with tools that help you experiment and break things.

In this talk, we will introduce Deepnote, a collaborative data science notebook and how you can use it with dbt to introduce more collaboration into your team’s data modelling and exploratory workflows.

Follow along in the slides here.

Full transcript #

Amada Echeverría: [00:00:00] Welcome everyone, and thank you for joining us at coalesce 2021. I’m a mother and I’m a developer relations advocate on the community team at dbt Labs. I’m thrilled to be hosting today’s session, dbt notebooks and the modern data experience presented by Elizabeth Dlha and Allan Campopiano.

Elizabeth is a product manager. At Deepnote, I started building a collaborative notebook for daily. Prior to Deepnote, she worked as a consultant and product manager at McKinsey developing advanced analytics solutions for the firm’s clients. Aside from building products, Elizabeth is excited about building high gravity communities and exploring new hiking paths.

Alan is a data scientist, also at Deepnote. He has developed peer reviewed statistical software libraries, such as hypothesize for Python and stats lab for Matlab. And has a background in cognitive neuroscience. Alan is an avid coffee drinker and [00:01:00] loves to cook, play music and write software today’s notebooks have improved the modern data experience by breaking down pre-existing silos across the entire team.

In this 30 minute session, Elizabeth and Alan will paint a picture of a world where analytics, engineering, and data exploration pair naturally in the. They will accomplish this by introducing deep, a collaborative data science notebook with dbt integration. Before we jump into things, some recommendations for making the best out of this session.

All chat conversation is taking place in the coalesce stash, dbt dash deep channel of dbt sock. If you’re not yet a part of the dbt sock community, you have time to join now. Seriously, go do it. Does it get cbt.com forward slash community and search for coalesce dbt Depot. When you arrive, we encourage you to set up Slack in your browser.

Side-by-side I think. Great experience. If you ask [00:02:00] other attendees questions, make comments share means or react in the channel at any point, during Elizabeth and Alan said, And if you’re new to the dbt community, or if this is your first time at coalesce, please don’t be shy. It’s very important to hear from new voices.

So feel free to chime in if you’d like to kick us off our chat champs and my colleagues from dbt Labs, SEngine, ascend, and Jeremy Cohen or Jericho on Slack have started a thread to have you introduce yourselves and share your favorite meme about data science, stats or machines. After the session, Elizabeth and Allan will be available in Slack to answer questions.

Let’s get started over to you, Elizabeth and Allan.

Elizabeth Dlha: Hi everyone. Thank you so much for the intro, Amanda. And thinking to the dbt for having us is our first seventh Kala. So we’re both very excited, so let’s jump right into it. Perfect. So today we’re talking about dbt computational notebooks they’re [00:03:00] in the play and their role in the modern data stack.

A little bit of background Elissa I’m Elizabeth . And I’m joined today by Allan, who is my colleague and the data scientist at the note, as you can see, we’re both from deep north, where we were building. I collaborating data science, notebook teams and companies come to us with a need for a powerful prototyping exploratory tool that allows them to work with other seamlessly that they can use to build models, analysis, and easily distributed those to folks who consumes datasets.

So naturally as a part of these work, we are not only thinking about notebooks in isolation but we’re also reimagining notebooks from depressed principals. We are thinking about, well, Of the notebooks framework should be within the broader modern data stack. And that is what we are going to be talking with you about today.

So there’s three key parts to this talk. First, I’m going to tell you a little bit more about notebooks as we know them today, their challenges and the key forces at [00:04:00] play that actually shift us away from legacy implementations. Second element. Walk you through how we’re thinking about the principles of the modern data science notebook, and share a little more about our implementation at the note.

And lastly the reason why we’re here today, we’re going to imagine a world where we ignore the fragmentation that we’re seeing in the modern data stack today. And instead focus on the experience. Ben Stansell wrote this amazing piece called modern data experience. If you haven’t read it, I highly recommend you do so.

And in his book post, he argues for actually moving away from. At the boxes and the lines of the modern data stack and thinking more about the experience of the end user. So in this third part of our talk, our challis, we will allude to this concept of the modern data experience. And partially we will play a little of magic, magic trick and get curious explore what it will look like if we married the chaotic world of data science and notebooks with the engineering.

[00:05:00] Notebooks of today #

[00:05:00]

Elizabeth Dlha: Off dbt . So that all being said, let’s actually start with the mother data stack and the boxes in direct. So we all know this image in some shape or form. This is a representation of the modern data stack. From the left-hand side, we have the data sources that are going through the ingestion layer, into the data, data storage layer, where the data transformation and muddling happens all the way into the multiple tools on the very right.

So traditionally data science notebooks, sit on they’re right next to BI and other analytics tools. BW, sorry for why notebooks are actually important. The first notebooks type interface dates back to eighties with the release of Mathematica by Walt, from before mathematics. Technical competency was actually really not very accessible to many suddenly a scientist or a mathematician could run competitions themselves.[00:06:00]

They could save and publish them in the notebooks format. And this really opened up the floodgates to immediate the computation and allows us to translate. High-level thinking into actual computation. We can not read that. Breed and compute used to represent and understand computational ideas. So to summarize mathematic really acted as this enabler for research education, for fields like quantitative finance and ultimately for data science.

This opened the doors to the innovation of implementations that we know today, like with your notebooks. So we know that notebooks are powerful and they solve an incredibly important problem. Historically, this is a snapshot of the mathematical platform. The paradigm really combines inputs and outputs.

It gives context to your analysis to your called and it enables you to execute and iterate rapidly. So it’s great for rapid data exploration is great for [00:07:00] visualization, for literate start programs. And for reporting. And if you’re here today, chances are that you are someone on New York team actually use notebooks as a landing zone for all of your exploratory.

This is also reflected in this graph that I’m going to share with you now, notebooks are increasingly popular among data scientists and analysts. The growing data maturity of companies reflects in a broadening user base. So this graph shows us a hundred fold increase in the number of Jupiter notebooks on the data platform over the past seven years.

So this is where we are, but as great as notebooks are, we could actually argue that the traditional implementation of notebooks is really started to feel that. As much as I love notebooks framework some of the existing implementations have pretty fundamental challenges and there’s actually whole white papers like this one and conference talks like this one.

Even on challenges of non [00:08:00] books, I really liked this one by Joel grass. If you haven’t seen. Please do I give this to all of our new joiners at the Pinellas. It’s a great source of inspiration for how we’re designing the platform. If this wasn’t an in-person conference, I would probably have you like shout out all the things that you dislike about notebooks.

So feel free to like drop them into desks, like chat. In the meantime, I’m actually going to go quickly over some of these, myself to just give you an overview of what are we actually solving. First notebooks have limited accessibility. These might be not the most relevant for you personally, but there’s a huge wave of new cutters into the data science world who might be coming from only thinks big round notebooks to them.

Hold huge amount of potential. They can accelerate their workflows, unload their productivity, ultimately their work, but we often hear that they have a high barrier to entry. There’s a bunch of things. There’s issues getting set up [00:09:00] reproducing work of others, installing the dependencies, getting the right data sources.

On one side, on the other side, no glass simply not intuitive for many. You have to deal with things like he then states and the order execution. Second version, Inc. With notebooks, it’s sometimes nearly impossible to find the source of truth, especially if you would work interactively. If you work with others you need to keep track of different.

I, by when B files flying around and so versioning notebooks with get can get so confusing that some people rather than do it at all, which is always an upgrade for appropriate engineering practices. Next because notebooks are really hard to share. They actually support siloing and fragmentation of the data science process.

We believe that data science is a team discipline, but ironically data science notebooks, generally don’t support collaboration, exploration, and prototyping. It’s something that is really limited to the data scientist [00:10:00] and the interface they’re using, which then results. And users of the data assets only coming in at the end of the process, resulting in delivery risks, timeline, slips, cost implications.

Next leaking data and security concerns. This is probably an obvious one. If you want to run an old book that you have locally, someone else sends it to you. You have to download a massive CSV file with sensitive data to your local device, and really increase the ethics surface for potential. They tell you.

And lastly, this is one of the most common criticisms notebooks are often seen as a playground, but not as a great tool to write clean production, ready code. So. Familiar with. But in addition to that, these issues that we know from the legacy of the notebooks interface, they’re further amplified by larger things that are going [00:11:00] on around us.

And I would like to highlight just the three forces which in my perspective change how notebooks can be implemented and used in organizations. The first is the focus on data productization, their role or data scientists and data teams is really changing. And it’s not simply to derive insights anymore.

It’s rather to build lasting data products that end users can interact. Second, the citizen data scientist is on the rise. There is an increased curiosity from non-big scientists in organizations who want to be able to self-serve analytics. They want to search for insights and use them in their work without handholding.

And lastly, our data science tools need the better collaborative interfaces. This is in part caused by, by this need for democratisation access to data and insights, but also COVID has changed [00:12:00] much of our work into our remote or hybrid setup. That means that we need tools that truly foster collaboration in between both data teams, but also crossfire.

So to summarize both the legacy challenges of the notebook and some of these global forces that I just described that we’re seeing in the market today, make us think hard about the notebooks paradigm and in the spirit of optimizing for the experience we would love to share with you how we’re thinking about addressing these challenges, how we’re thinking about the notebooks off tomorrow.

And so that we only had an old book, but also reimagine it into a tool that can foster. At a collaboration, but the accessibility. So now I will hand over to Al who will tell you a bit more about what that looks like in practice.

[00:12:52] Notebooks of tomorrow #

Allan Campopiano: All right. Great. Thank you so much, Liz. Yeah. So let’s talk about now the notebooks of tomorrow. So Liz mentioned many forces [00:13:00] that have had an effect on data science notebooks. In my previous role, as a research analyst at a school board, I worked on a very small data team and I have personally felt the pain of traditional notebooks.

So let me spend the next few minutes describing my own expense. And the potential of the modern data science notebook. So initially when I began using notebooks for research and data science projects, automation reporting, things like that, I was very much experiencing the pain depicted on the left-hand side here.

That is, it was very difficult to share and collaborate effectively with notebooks across the team. They siloed us. I remember one project in particular, I use Google CoLab and we experienced all kinds of problems with sharing data mounting drives. It was just not a collaborative interface. So every time someone wanted to make a change, we’d have to refresh the notebooks.

[00:14:00] And so we do like using notebooks and there’s a reason why we were using them, but this experience was quite challenging and the project just took a lot longer than. So our team has really thought about these issues and we’ve made great progress towards the outcome depicted on the right here that his notebooks should not just be for data scientists, but they should be for the whole data team.

And in that world, data teams are less siloed. And data projects from end to end are completed with much less friction in terms of the pain points that we’re just asking. So, let me show you some of the improvements we’ve made to the notebook as we follow our vision of what the modern notebook should be.

And what I would like you to think about is where are you think notebooks could be in the future, perhaps as they relate to analytics engineering,

it’s clear that we must lower the barrier to entry, right? One of the biggest [00:15:00] barriers traditionally is frankly, the. Newcomers to notebooks should not have to deal with terminals, installing Jupiter and its dependencies and setting up virtual environments and so on. So our notebooks run in the cloud, they come with pre-installed libraries, they build their own requirements dot text files, and assets of any kind are shared with your team.

So for example, here’s a project that was shared with. You can see that pandas is pre-installed. And when I PIP installed great expectations, Deepnote offers to delete that, sell and move the details to a requirements. Doc text file. Of course, there’s a file system and our notebooks are in the usual dot IPI and B format.

Oh, and look, my teammate connected our snowflake warehouse too. And so I can immediately start queering against that as well. Perfect. So instant set up [00:16:00] another step towards lowering the barrier to entry is no coach. Consider a technical, a non-technical manager, rather who has valuable domain knowledge. They’re not familiar with Python or the Python visualization ecosystem. No problem. In Deepnote, any data frame can be visualized with just point and click. And this is great. So now that non-technical manager can leverage their domain knowledge without being blocked by not knowing the.

Let’s also break down the Python barrier while we’re at it. Let’s face it more people know SQL than Python. A lot more. So SQL is a first class citizen in our notebooks query with SQL we’re turn a pandas data frame and even pass Python variables to your SQL query. So for the SQL first analysts out there, [00:17:00] perhaps consider the notebook as your new SQL editor and let us know what you think of that idea in the chat notebooks can be chaotic.

That’s part of their charm. Of course, this comes at a cost when you realize you need to keep track of changes or revert to an earlier state, we keep track of everything for you. So you’re safe in that sense. We keep a detailed snapshots of your work, as well as the work of your teammates and diffs can be viewed. And you can roll back to earlier versions, time to break down some more walls with real-time collaboration. Of course, RTC is basically table stakes now for our interfaces. I mean, look at Google docs. For example, the modern data experience is a collaborative one and our notebooks are no exception. Use real-time collaboration to conveniently show a learner, the correct [00:18:00] steps or complete sub tasks sub sub-tasks in parallel with your team, instant feedback and visibility come as a result.

It’s also quite fun and inspiring to see your teammates, cursor, bobbing, along as you solve problems together. It really feels like a team. It’s a similar story for comments. So.

If you’ve used traditional notebooks, you know how difficult it is dealing with shared resources, API, keys, environment, variables, files, secrets, etc. Well, let’s put another nail in the coffin for siloed data teams. Our notebook connects to everything and writes back to everything. And these connections are shared across the team.

So connect something small, like a file or something big, like. Something structured like Postgres or something, unstructured like a data lake or a bucket. Our notebooks essentially function as a hub[00:19:00]

In order to be in order to be really able to work on a team, connection, details, keys, and other sensitive information. It has to be protected. And we take this very seriously, which is why our projects have flexible permission settings, allowing you to control who sees what, for example control, who can edit your warehouse connection and who can not, or control who can execute a cell and who can not the modern notebook experience shouldn’t stop short. It needs to cross the finish line in a manner of. When you’re done exploring prototyping, narrating, and all of the typical things you do in a notebook, you should be able to wrap that notebook up and productionize it. We believe this is a natural and often necessary thing to do.

This is why our notebook can be, or this is why any notebook of ours can [00:20:00] be turned into an app. Just choose the code. You want to show, arrange the cells on a grid and share with your stakeholders, just like that. And same goes for scheduling. You shouldn’t need to know how to set up airflow or some complex orchestrator for basic scheduling or pipelines.

[00:20:20] The modern notebook experience #

Allan Campopiano: We can handle many powerful use cases for scheduling directly and simply in the notebook. So taken together our team’s gotten closer to realizing our vision of the best modern notebook experience. We’re lowering the barrier, keeping the code up to date, unifying the data team, securing sensitive information and creating notebooks that can be productionized, but there’s still so much left to do.

And we have so many ideas, speaking of which probably most of you use dbt . And [00:21:00] so a natural question is given how notebooks have changed the data experience for the whole. How would dbt fit into the modern notebook? We’ve actually been very curious about this for a while. And some of our users are telling us that this is actually something they need.

So let me share a few quotes with you. Then I’ll show you what our team has imagined for a dbt .

Bhargava from Axion Ray rights while a lot can be done in SQL many stats and ML models need Python. The customer needs more than just analytics engineering with dbt . They will need prediction, forecasting scenario analysis.

Michael from BUFFER "Notebooks, facilitate fast iteration and exploration, but dbt happens in vs code and the [00:22:00] terminal. Having dbt in a notebook could reduce friction when iterating on models and downstream Python processing".

Michael from Slido, "Modern notebooks, have an amazing story. They, it could become the ideal platform to keep the whole analytics process together in one place, not only from exploration to data storytelling, but even beyond to test and build data models through integrations, with tools like dbt".

If I do some cursory topic modeling for those quotes combined with our philosophy around the modern data. We can see that users want to reduce the friction between analytics, engineering, and analytics. So along those lines, let me show you what we’ve come up with for a dbt integration to give you an idea of how this [00:23:00] integration works. Just in general, just take a look at this schematic. So you can see here deep notice connected to both the warehouse in this case, snowflake as well as dbt Cloud, which itself is connected to snowflake and get hub. Now from within our notebook, we can pull in dbt clouds, metadata, and this can tell us about the freshness of our tables as well as the state of dbt s tests.

As an analyst, data models, undergird my analysis. So I need to have trust in those models as I develop insights into my data. And so we figure why not surface that information where it’s needed right. In the notebook.

Let’s take a closer look at this in practice. So here’s a bunch of integrations. We would choose dbt Core. And we can simply point the [00:24:00] integration to our dbt repo, add our dbt Cloud API token. In addition, we can even connect automatically to any existing warehouses we might have.

And just like that our data science notebook is integrated with dbt . Now there’s a lot more we can do at this step. And I will mention some of our ideas. But first let’s check out the metadata that we can see in the notebook. Okay. So here we are in a notebook. This is this is an SQL cell that’s querying projects in our analytics database.

And since we’re integrated with dbt , we can see some helpful metadata at the bottom of that. For example, we can see that dbt ran 11 minutes ago. And this tells me that I’m querying data based on transformations that occurred fairly recently. I’m now confident, like as an [00:25:00] analyst, I’m now confident that my analytics are based on pretty fresh.

In addition, I can see that the tests on the underlying models have passed again, giving me visibility and confidence in the underlying data transformations since I depend on them. So no need to check with the analytics engineer. I can see things for myself right in the notes. Now, as I said, there’s so much more we could do here such as populate a notebook with all of your dbt models.

We could show you the compiled SQL as you’re writing your models and macros, and even provide a way of running your models directly in the notebook. Of course, you can already do that with the integrated terminals that we have, but we have ideas on making this much easier than. Thing is, there’s a lot to think about here and we need to strike the right balance between what should be in the notebook and what shouldn’t.

And we would love to have your feedback, let us know [00:26:00] in the chat, how you envision the intersection between analytics, engineering, and analytics. And for now I’ll hand it back to Liz to take us through the conclusions.

Elizabeth Dlha: As far as the takeaways. My first takeaway is that notebooks just make sense despite their shortcomings, they hold a huge amount of potential and can really unlock the kind of roles in a data team. So for example, to data scientists, notebooks can be a really powerful primary tool for collaborative exploration.

I’m not lists, no books can be a tool that can automate their workflows that can unload their productivity. Especially as Al showed us as notebooks, continue to incorporate more SQL based tools and some modeling functionality like the integration with a dbt , they can even become a analyse preferred editor and two engineers notebooks can be a wonderful interface for improved visibility and improved [00:27:00] exploration.

The second takeaway that I’d like to leave you with is that the role of the notebook in the modern data stack is really rapidly evolving. We believe that notebooks can really shift from the top thing that is used in a self-contained process in isolation to something that becomes a center of gravity in any data driven.

We will know, looks to become this connectivity issue for organizations where stakeholders can come in, they can contribute their subject matter expertise throughout the analytical process. And this will ultimately lead to better analytics outcomes. And lastly, we believe that Noble’s controlling, become the lingua franca, the connection between the analytics engineering and the data science world.

So to wrap things up. I hope we go to get a little bit excited about the evolution in the Netflix category and also about the possibilities that notebooks can unlock for your teams and for collaboration. So my ask from you today [00:28:00] is if you are a keen notebooks and dbt user, or even if you don’t really use notebooks, but are curious, We’d love to hear from you.

And we want to really explore this interplay of data, explosion, analytics, engineering, and want to do deeper. So if you want to get on board and help us shape this concept and break things please let me know you have our emails here or just DMS in the dbt Slack.


Last modified on: Oct 21, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts