dbt
Blog Building a metadata ecosystem with dbt

Building a metadata ecosystem with dbt

Oct 29, 2021

Learn

dbt has democratised the role of ‘Analytics Engineering’. It has empowered a new wave of ‘data practitioners’ across teams within the Data Mesh.

The Data Mesh is an architectural paradigm that decentralises data into product teams over a centralised data team. Combining dbt and the data mesh creates an explosion of dbt users and with it many organisational challenges such as:

  • How do you maintain high dbt code quality across teams?
  • Who owns what data?
  • If data breaks in production who is the team responsible to fix it?
  • How do you discover data built by other teams and leverage it within your own development?

Follow along in the slides here.

Full transcript #

[00:00:00] Kyle Coapman: Hey there, dbt community. Welcome to another session of Coalesce. My name is Kyle and the head of training over at dbt Labs. And I’m super excited to be your host for the session. We’re going to get meta with the session called Building a Metadata Ecosystem with dbt. Our wonderful speaker is Darren Haken, the head of engineering for platform and data at Auto Trader. He starts his day with coffee and looks forward to driving his new electric vehicle, a Volkswagen ID3, and in a past life was a DJ in Manchester. He’s a huge fan of Japan and has visited five times.

Just so you know, all chat conversations are taking place over in Coalesce Metadata Ecosystem channel of dbt Slack. If you’re not part of that chat, head over to community.getdbt.com in your browser, and then find the channel Coalesce Metadata Ecosystem. We encourage you to ask attendees and the speaker questions, make comments, react, drop memes, all sorts of things, especially metadata, the hot topic right now. Would love to get the chat going. If you have [00:01:00] a question, please just put the red question mark emoji at the beginning of your comment so that I can make sure I bring that up later if we have time for live questions. And so at the end of the talk any questions that haven’t been answered live will be answered directly in Slack. And without further ado, let’s get meta with Darren, over to you.

[00:01:19] Darren Haken: Thanks for that, Kyle. Yeah, I truly mean it. If you do like my talk, I am definitely happy to accept a Shiba, I’ve wanted one for many years. I just need the perfect opportunity. So yeah. Hi, I’m Darren, head of engineering at Auto Trader.

I work in the platform data space. My role really is to build capabilities across our platform that help the rest of the organization. The big areas that I focus on is the data and analytics. How do we democratize data? How do we build tooling and capabilities around machine learning analytics, the whole raft to help the rest of our product teams, marketers, executives, whoever to work with data.

Today’s agenda, that kind of structure I want to go through around this metadata [00:02:00] topic that I’m very passionate and very interested about at the moment, especially with the platform lens that I do my day to day. I’m going to start by talking about Auto Trader’s journey to this decentralization.

And this is the kind of context or the groundwork that explains why this massive data topic is so interesting that I’m going to talk about the hero, this the metadata. And I’m going to finish off a little bit around if you’re interested in metadata, which I hope you are after this talk, how do we as a community, how to take that forward and move it as an industry.

[00:02:29] Auto Trader’s data journey #

[00:02:29] Darren Haken: So let’s start with a bit about Auto Trader. So for those that don’t know who Auto Trader are, and some of you might not know where they are, the largest automotive marketplace in the UK established in 1977. So very old. In fact, I took a screenshot on the right here, so we actually used to be called in 1977, Thames Valley Trader.

[00:02:49] Darren Haken: It looks very old. I was actually surprised not to see horse carriages on there. It looks that old, but if we just say old kind of magazine, you could buy it in the Thames Valley area, which is a part of the UK [00:03:00] and buy a car. And to the right of that is this kind of a modern take on that, which is a fully digital organization that helps people buy and sell cars online.

And in terms of scale, which is always helpful for this. We have about 58 million cross-platform visits and we’re around roughly about the 12th largest science and the traffic here in the UK. So pretty reasonable size data and lots of activity. And generally people love to look for cars or buy cars.

It’s a very kind of key thing for a lot of families. And so the part that’s interesting about this is 1977 is a very long time. And that actually means that Auto Trader has had many stages in its data journey. These aren’t accurate timescales are just trying to show more of a sort of growth or movement or transition.

But at some point when computers existed around, God knows when, I imagine at first we actually worked with paperwork, but the dawn of analytics as far as I’m concerned from a digital perspective, where we probably had people working with spreadsheets on [00:04:00] some level and trying to do analytics about the marketplace and helping people helping us make sense of what we need to do, some period around sort of 15 years ago.

And we started building out a centralized data warehouse and building reporting on top of that. And it was a very classic period of analytics for that time. I’m sure if some of you have been working on analytics since that period you’ll fully know it. And moving further into sort of more modern terms, we started looking at data lakes, machine learning, how do we do more predictive components of that, which has then led us into this sort of more recent journey where I’ve been a lot more involved in last years. And that’s building out what we call our self-serve data platform, which to me is, building capabilities that’s self-served by the rest of our organization.

And then recently one of the bigger focuses over the last sort of 12, maybe 18 months is more of a move to decentralization with how we work with teams in our platform on data. And we’re really focusing on this data mesh [00:05:00] architecture that’s been discussed quite a lot in industry, which hopefully some of you may have heard of.

[00:05:06] Self-serve data platform #

[00:05:06] Darren Haken: So in terms of this self-serve data platform worth giving a bit of an explanation of that for what it is in my opinion. So one of the key areas some of my teams work on is building what we call our data plan. And that is building capabilities that let product teams marketeers, like I said earlier, work with data. And we do that by building tooling or making it easier to work with certain things.

As an example, we use a technology called Apache Spark. It’s very complex to run that from zero from, I want to run smart. So what we do is we provide tooling to make it really easy to do that. And what are the other areas that we love which I’m going to start describe as our practitioner.

So practitioners are people who are maybe data engineers, software engineers, analytics engineers, data scientists, I want to bundle thatunder practitioner. One of the technologies that we’ve fell in love with across Slack kind of cohort is dbt. And so crucially from a self-serve data [00:06:00] platform perspective, we’ve invested in trying to make that available to everybody to work with and make it as easy as possible.

And through that we’ve actually seen this explosion in data practitioners. So by trying to build this sales data platform and invest in dbt we’ve actually seen this huge shift of an explosion of users across the organization. It’s going to back that up with a few illustrations. So we’ve got an explosion of users based on, we started working with dbt about 2019.

Actually I was tinkering with it and looking at it. And we were exploring it as an alternative solution to working with Apache Spark with the premise that it was easier for a wider set of practitioners to work with. And that’s pretty good. We’ve moved into a place now where we’re actually more probably in the 70 regions.

I put this together a few months ago, but we’ve got about 60, 70 active practitioners. And this is across an engineering goal of 200 people with a organization size of about 1000. And [00:07:00] the reason I can pull this through, by the way is at the moment we use dbt and we use it in like a monolithic approach.

So we’ve got one central repository. So obviously I can go to get help like this, and I can see contributors on master. And that’s pretty awesome. We’ve now got from four people ish to 60, 70. That’s great. That’s a nice explosion, that’s a vibrant community. And another lens of how we can see that is we’ve got a bunch of people who are using Slack and we’ve got this sort of review process where someone’s doing dbt, they should get somebody else to review that.

So if I look in Slack, which I did a few months back, I was putting this thought together. There’s a bunch of people who were like, Hey, I’ve done some dbt work and somebody have a look at this for me. And we’re here. That’s awesome. Little bit more worrying in terms of the explosion of users, is this 79 active pull requests.

So that was actually so another lens I thought as well, if we’ve got active contributors, [00:08:00] and the way that people do things is pull requests, let’s have a look at how many we’ve got. So 79 actually is a little bit more worrying because even though it shows an explosion of users, it is actually probably showing we’ve got a bit of a backing up here as well.

[00:08:13] dbt + decentralization #

[00:08:13] Darren Haken: We’re not actually getting through them. So little bit more alarmed at this one, but nonetheless is showing an explosion of users. So to bring this sort of what I was describing as decentralization together with dbt we’ve tried to organize the way we work with data around kind of business domains or so. An example of that would be, we’ve got this marketing domain and inside that we’ve got these groupings of data that represent things like looking at campaign performance.

This might be like social campaigns or advertising campaigns and looking at the performance of them, maybe we’re tracking a bunch of data around SEO. And so we’re actually trying to build out a better data modeling to look at SEO, optimizations and things we can do with that. On the other side, who might have a, I don’t know, a sales domain and in there we’re [00:09:00] looking for sales opportunities or churning customers and that kind of thing.

And with that, we have a bunch of people working in these areas who are actively developing in dbt. So within the sales domain, we’ve maybe got, five, 10, whatever number of people, actively developing dbt data models to do things like risks and sale opportunities. And on the marketing side, we may have a similar size number of people doing things like campaign performance and so on.

So explosion of users, lots of activity with dbt. Brilliant, amazing. However, distributing our data across the organization has not just been the perfect situation we expected. And we have seen a set of new challenges that have come from that, which is what I’m going to get into now, the challenges.

[00:09:50] A new set of challenges #

[00:09:50] Darren Haken: So one of the challenges we’re seeing by this de-centralization is this problem, or sort of ownership/maintainers of dbt. So we’d have this exposure [00:10:00] people now working across all these domains and they’re actively working, but we’ve experienced this problem where people work on things and they stop working on them and forget about them.

And the data platform team, one of my teams did a, an audit of all the dbt models we had in our monetary pay, and it looks for, I don’t know, people who wave their hand and said, have your eye on that, that’s something I’m responsible for. And 20% of our models have no sort of ownership or maintainer. And this isn’t a a dig at dbt specifically because we actually looked at other tools.

We use like Airflow, Spark and other things, and this is just a common occurrence or a problem spread across all of our technology. And it’s actually more this decentralization, I think that’s caused it. More alarmingly then, what we discovered is we use Looker to drive and access some of our data.

We build a dbt and we found that Looker has 74 references to BigQuery tables. We use BigQuery that, that were connected to these open source dbt models. And that’s a little bit [00:11:00] concerning because people who use Looker might not really be as connected with the dbt work. Maybe it’s our CFO or a CMO or it could be somebody in our accounts and sales team who were accessing data.

And that’s the view. This is well-maintained and reliable data, right? Because it’s available to look at, but that might not be true. Certainly maybe fully 20% / 74 references. So ownership problems, which I’ve covered that, but just to make sure I’ve not missed anything. We start asking questions, especially my platform teams who want to look at like quality and the stability here, or who are the maintainers of these models? Like what, who is it, we can’t find them. And then that kind of cascades into what stakeholders, depending on the data. And the bit where I think this is really important as well is, if we have a bug or some sort of production incident, or something goes wrong with that data who is there to fix that, who resolves it? If a model is broken, who fixes it.

So this is a problem that we started to see when we saw this explosion of users. [00:12:00] And then this is other thing which I struggled to give a title for, but hopefully this makes sense, but it’s this sort of idea of sharing data without permission.

So it’s really easy, I think in dbt too, especially if I want, when I’m on a repo at the moment to access and reference data that you’ve got the ref function, we can look up and I can access data. But that means it’s very easy cross team at the moment for like team A to access data from team B.

So I see it as it’s like teams depending on other teams data without permission, because this ref hasn’t been agreed necessarily by one of the other teams. And so what we’re seeing here is the sort of lack of contracts. So team A can access team B’s data, and they don’t actually know that, they’re just, they can reference in that kind of blissfully unaware of that.

And we’ve seen this over issues that start to manifest with that, where I don’t know one team. So we had like a sessions data. So tracking user tracking and data, and we’ve got some sessions table. Maybe a couple of our different teams [00:13:00] have got tables they call sessions and it’s for different parts of our business, right?

So maybe a brand new cars, leasing cars, used cars, and they want to look at sessions around these sort of vehicles, but they call them all. We found that we ended up with this kind of multiple sources of truth problem where different teams are referenced in different sessions tables cause they thought that was the right sessions table.

And then we have to backtrack on some of that and that creates this sort of ripple, right? So there’s trust confidence. People would want to validate, am I accessing the right sessions table here. They want that validation cause they don’t trust the data right there.

They’re asking that question to make sure. And then they have a part of that is it’s may it feel like it’s expensive for us when we want to restrict access to data. So we’ve got very openness at the moment, very easy to access the data, but actually sometimes we don’t want that, maybe for GDPR reasons or personal data or some sort of intellectual property, whatever tax do we want to restrict that and just, I guess the sort of blocking of day areas at the moment for us feels more expensive.

[00:14:01] Darren Haken: I’ve got an example of this. Don’t worry. Don’t need to read all of the slide. I’m not sure if any of you can actually read all of the boxes, but visualizations are helpful. Nonetheless, in this example, I’ve got this kind of top note and that’s what we call one of our teams, which we call the stock and search platform team.

And they had a bunch of dbt and they’ve got it in the kind of world of dbt modeling and they built out a bunch of stuff. And actually they’ve got data that a lot of teams care about. They’ve got really high value data. And so actually all these other teams underneath referenced that data because they want to work with it, to do what they need to do in their area of the organization.

But they’re stuck in searching. Didn’t know that all these people were referencing their data through the kind of ref function. And so when we actually had a data incident, we had a production incident, whether it was a problem with that stock search platforms data, and it cascaded. So the blast radius was quite significant and all these different teams were effective.

So yeah, don’t worry. Don’t read all of the boxes and everything, but it was just a really straight in this kind of blast radius idea of, the [00:15:00] data was in-office encapsulated. It was leaking out into different teams.

[00:15:05] Discoverability was becoming a challenge #

[00:15:05] Darren Haken: And we’ve got this other problem that started to happen with this explosion of users where discoverability, finding the right data is getting harder and harder.

We’ve got people that moving to Slack and almost this sort of tribalism about Hey, does anyone know where I can find this data? I’m looking for X, I’m looking for, why can’t people help me? So that’s why I described it as like tribalism, because it’s about what the people know in the community.

And that has some upper limit right on scale because eventually, everyone can remember all the day. We were at least not an organization of where I work our size. So it’s a good and a bad thing, right? Like people are obviously interested in the once access data, but at the same time, if it was more discoverable and more intuitive than maybe they would have to use Slack to ask these questions. So we’ve got these challenges.

It’s happened through this decentralization and explosion of users, which is still fantastic problems to have, but I still feel like we need a scale away to overcome these [00:16:00] challenges exposed by decentralization.

[00:16:04] Automation over human processes #

[00:16:04] Darren Haken: And an overriding principle for me, or, overarching principle is I’d like to approach this with automation. It’d be very easy actually to introduce governance and human processes and say let’s go harder on reviews or let’s hire more people to attack all of the active pull requests. A lot of my background is software engineering and I’ve learned through that process, that human processes are either error prone or we just cannot keep up with the scale, right?

So we need automation, ways of validating that and to keep us moving into scale. And that’s where I, for me, one of the most interesting things or ways we could probably solve these challenges is through this idea of metadata.

[00:16:47] dbt supports metadata #

[00:16:47] Darren Haken: So some really awesome news, dbt supports metadata. It is part of it. It supports it, it’s brilliant. So the idea is that in your kind of schema or demo files, your definitions of your models, there is this [00:17:00] meta block that sits underneath the model. And you can see it on the right hand side on this slide and inside it can contain key values. So for example, I think this is from the actual dbt Docs, or it was at some . And we’ve got owner and we’re saying you have @alice and there’s this other one, which is model maturity.

And so we can encapsulate all of this model development and metadata together. We’ve got metadata that’s code, which, I’m sure, hopefully all of you know is great and it’s all version control. So that’s brilliant. And then, because it’s all this code, we can actually integrate our metadata and use it in our CI/CD pipeline and the way we deploy.

And this is just a CI/CD pipeline where code gets captured in GitHub, dbt does a compilation like build stage. And in, in there, I’ve got this green circle, which is what I’m saying is we can build a hook to apply what I’m describing as automated policies on the metadata, and I’ll get into what [00:18:00] I mean by that next, we then use Airflow at the moment.

So you schedule dbt on a regular basis that pushes data, that writes data into Google BigQuery on one of the other hooks, which I’ll touch on is we can synchronize this metadata to the broader ecosystem.

[00:18:16] Platform governance #

[00:18:16] Darren Haken: So automating policies. So one of the kind of models that we follow at Auto Trader is we group metadata together, use it, it’s like a namespace or a prefix. So we’ve got this data governance dot model. So anything to relate to data governance would be grouped together. And we’ve come up with all these different attributes, like business owner, team owner, and all that kind of thing. And the great thing about this is we can create policies.

So in that build stage, I described earlier this hook in CI/CD, we’ve written some Python modules that inspects the dbt manifest, and it looks to see what values are exist for all the models on the metadata. And that lets us define [00:19:00] policies. We can come up with any policies we want, but as examples of that could be, we want to make sure that all of our models have an owner.

And if they don’t have an owner, we fail to build. Maybe we create some sort of flag, which is like the life cycle. And we say production models can only be used by other teams. And so if it’s a developer one or it’s in testing, or maybe we use this PII flag and say only the security team can access that datamodel and we can apply to an automated through the policy.

And we’ve done this through kind of inspiration from a linting platform. If anyone’s used one, then we can get the idea, but we’ve got these rules and we’ve built rules. And then on the right hand side is a little box. The bottom right, is showing this is a legitimate build failure that happened in CI/CD.

So no metadata existed for one of the for the models. And so it failed to build and that can’t go to production anymore, which is amazing.

And then on the ecosystem side so we’ve covered the build automated policies, and on the right, I have that other block over green [00:20:00] blob, which was the synchronized, the metadata.

So I’ve exploded that and what I’m really saying here is you can use that metadata and you can push it out to other technologies that you’re using in your modern data stack, your ecosystem. And for us, where we’re looking at a metadata solution slash data catalog called Datahub, and we’ve been using for quite some time, a monitoring data observability tool called Monte Carlo for sure.

A couple of examples of how we use that metadata in that way. So one of the other kinds of namespace blocks that we’ve used is this data observability. So anything we see related to data observability, we’re bundling together under a namespace of data observability. I know that the only one that’s probably worth looking at here without going into too much detail about Monte Carlo is this notification channel.

So the idea of that is, the team that’s building the dbt models can say this notification channel is a Slack channel. And if there’s a problem with my data model, I want you to send the [00:21:00] alert into there and tell me there’s a problem with my. And so like in the packages we’ve written in Python in the CI/CD, we’ll hook into that and then synchronize that into Monte Carlo.

And this is an example of that. So Monte Carlo is on the left. So essentially we’ve grabbed that meta data. We’ve posted it to the Monte Carlo API and it’s registered there.. And then when there’s a problem on the right hand side, you get this Slack alert to say, there’s a problem with your data model in production, and it’s gone wrong and you need to look at that.

And we’ve driven all of that through automation synchronizing it into them ecosystem. And on the other one we’re evaluating at the moment is this trying to tackle discovery and that’s through a kind of central metadata. So we’re looking at tool called Betahub, which is an open source metadata store.

And the great thing about these is they can bring metadata together from all sorts of services. It makes them searchable and hopefully it reduces tribal knowledge, reduces the site queries. And on the right, we’ve actually been pulling in all the metadata fields into it. [00:22:00] And it’s already around, out new features.

It’s worth looking, but we’ve also started seeing that they’ve grabbed this meta block and they’re pushing it into all the different components like owners and tags and everything. So it’s really exciting and it makes it fully searchable. So I’m running out of time. So I’m going to get through to the end part, which is hopefully by this point, you’ve seen the problems decentralization can cause and like multiple people, lots of people working with dbt to scale.

And hopefully you can also understand that metadata might be one of the ways that we can tackle some of this and, keep that explosion of users cause we want that. But we also want all that stability that you get when you’ve got fewer people working on a code base.

[00:22:39] We need centralized metadata #

[00:22:39] Darren Haken: So where do we go from here is around what if you want to start using metadata? And so one of the things I think we need an industry is a metadata, a centralized metadata store. So even though lots of tools attempt to, dbt for example we’ve got metadata in dbt, but I’m imagining you’re in a similar place to me where you use lots of [00:23:00] data tools around dbt, adjacent technologies, maybe a BI tool like Looker, a data warehouse, I don’t know, Fivetran to ingest data, whatever. This diagram here is actually from LakeFS, a company who built up open source file format, file storage, and they do this landscape for you every year. And you can see this absolutely loads of technologies. I use some of these, you’ll use different ones, but the point of all of this is why we need to centralize the metadata so we can get all of these tools and bring it into a central hub.

[00:23:29] Darren Haken: In my opinion, there’s a few worth looking out there. So data helps when I mentioned we’re reviewing Amundsen, it’s another one that’s quite interesting. And we’ve got Marquez, but there’s loads of them popping up. Now, these centralized kind of metadata stores, it’s not really a solved problem, but it’s very exciting.

And I know that dbt is also looking at it. I’ve seen a few things coming up as well. So that’s super exciting. One of the other things I think that we need to look at is this standardized metadata. So these ideas of like ownership and [00:24:00] governance and observability that we’ve come up with that Auto Trader, I don’t feel like they’re unique, right?

These concepts like must exist across the organizations. And this is why I think we need to start exploring like standard ways of documenting ownership, for example. And that’s super important, right? Because then the ecosystem of metadata can thrive because lots of tools can say as long as I understand what this standardized metadata is for ownership, I can use it within my software, my tool.

And without that, we’re going to end up with 500 different ways of defining ownership. And it will be really expensive to integrate all together and have an ecosystem. There are some emerging standards out there for this. So we’ve got open metadata and open lineage. You’re worth looking at that and making attempts now at trying to build these standards.

And then I think is basically what we need. So I think it’s just really exciting and I’ve go and take a look at these projects. So really the reason I put this talk, and this is the final part for me is [00:25:00] for me, this is a call to arms actually. So this metadata thing for me is super exciting. It’s the only problem is there isn’t any standards out there and there are no kind of incumbent metadata solutions yet.

[00:25:14] A call to arms #

[00:25:14] Darren Haken: But it is really important. And I think this is where it’s a call to arms to me. If you think it’s important or you believe this metadata concept is a good idea, then I think the way we can take this forward is through open source, right? If we get together, especially the dbt incubator cities, there’s absolutely loads of people using dbt, we can all work together to drive these standards and maybe invest, or maybe we contribute to Datahub, or maybe we contribute to Amundsen and maybe we contribute to Open Lineage, but whatever it is, I’m just trying to spur people on to think about how you could contribute.

So takeaways for the talk, distributed teams, or lots of people working with dbt across lots of teams in a decentralized fashion can introduce a lot of these challenges I’ve seen. I’m sure it’s not unique to me. [00:26:00] Like metadata is a really powerful tool that we can use to automate some of these problems.

And I also think these missing components in industry around metadata, in the modern data stack, we can solve them together by contributing. So that’s what I have you do straight after this talk, but also buy mea Shiba. So if you want to know more of Auto Trader where I work, we write lots of blogs and everything.

We try to get out there on the engineering blogs, go and take a look. And if you’re in the UK or thinking about moving to the UK and you’re interested in a role with us, then go to careers at Auto Trader or feel free to message me. And we’re always looking for people that are passionate. And that’s all, and yeah, I’m open for any questions. So then Kyle, if you can jump on now? Okay.

[00:26:46] Kyle Coapman: Yeah. Darren, thank you so much. I really appreciated how you mentioned pain points throughout there and how you address them. I have one question I’ll raise from a different Kyle Shannon. And his question is who on what teams were involved in defining the structure of the [00:27:00] metadata you use organization-wide and how did those conversations go?

[00:27:04] Darren Haken: Okay. Yeah. So this is actually a really good question. I’ve not been asked this before, when I’ve talked about it before. So we got the teams together, actually. So a lot of it’s been driven through the challenges that we witnessed.

So we’d find a problem and then we’d look at what metadata we needed to tackle. This was one of the issues I found in, so I think standards are important. So we started looking at ownership first, and then I thought surely there’s going to be a bunch of people out there who said here’s some really good metadata you’d want to capture about ownership. And there wasn’t actually, as I spoke to passionate people in the space who shared what they’ve learned through their organizations. And we got started from there, but what we did is we just brought our work in groups together, like a specialist interest group or a guild around the metadata in the org.

And then pulled like dbt practitioners from say three or four teams and said, Hey, like we need to move this forward. And then building out that way. But it’s been evolving. So we started as, as limited as we could. And then widened, widened the metadata.

Last modified on: Oct 21, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts