One dbt: Accelerate data work with cross-platform dbt Mesh
Dec 19, 2024
ProductOrganizations often find it difficult to get a truly complete view of their data when it’s spread across multiple platforms. That’s the reality for most companies these days, however. Some of your data may be stored in AWS, while the rest is split between GCP and on-premise servers.
Despite this, you still need to provide a unified experience for working with data for all of your data producers and consumers, no matter where your data lives. In this article, we’ll explore how One dbt provides this by bringing data under a single, unified data control plane, enabling you to build powerful, cross-platform data mesh networks.
Using the Data Control Plane to implement the Analytics Development Lifecycle
The Analytics Development Lifecycle (ADLC) is a cyclical process that borrows from the Software Development Lifecycle. The ADLC helps organizations work with their data, operationalize it, and discover new types of data with which to repeat the process.
This approach solves questions that your data consumers may have about data origins or process rigor. However, implementing the ADLC is a multidimensional process. And that’s where the data control plane comes into play.
The data control plane isn’t a dbt-specific term. It’s comprised of components that dbt believes are critical to implementing the ADLC and creating a unified experience for all data users within an organization:
- Orchestration
- Observability
- Catalog
- Semantics
- Transformation
dbt calls this the Active Metadata Layer. It’s a layer within your data pipeline that lies between your ingestion and your analytics or AI/Business Analytics deployments.
We see this layer as critical to data transformation and management. This transformation is basically a declarative statement of how the business thinks about data and how it should be organized to answer the business’s questions. It’s that metadata about your data that becomes valuable. Things like:
- When will this pipeline run and how fresh is the data?
- Where did the data come from and what is its lineage?
- Have the metrics being used been calculated properly and vetted?
Providing a unified experience for all data product users
In the last 15 years, I haven’t seen a single customer in the data and analytics space that has all of their data on a single platform. It’s always in more than one place.
That presents some hard challenges. Getting a comprehensive view of your data that includes the interdependencies between your transactional and your analytics data can be a tough nut to crack.
It’s not just the data team itself that works within your data estate. Your data engineers, analysts, or marketing and revenue operations teams all may have varying levels of technical skills, but are domain literate and need to work with the data, too.
Historically, users of dbt come by a number of means. By far, the most common one is dbt Core, which is open source.
That doesn’t mean that there aren’t teams who weren’t using dbt Cloud right from the start. Sometimes, those people found dbt Cloud more suitable because they wanted managed deployments, or the teams they work with weren’t as technically advanced, etc. Maybe they wanted lineage, catalogs, or metrics so that they could all work faster and have more accurate results.
Whatever the case may be, there are many reasons that dbt users might be using Core, Cloud, or even both. If you’re already using dbt Cloud, dbt Core may not seem appealing because all of the benefits of Core are also included in Cloud. Yet some organizations find themselves in a position where it makes sense to use both. One dbt is meant to address precisely that.
Demonstrating effectiveness with a hybrid approach
Yannick Misteli from Roche was in exactly that position. Yannick said that after investing in a data platform, the team at Roche needed to show that it was being used and valuable, so they used dbt Core. They were then able to leverage dbt Cloud afterward to scale that impact across the globe, covering more data products within their organization.
This approach allows you to steadily build trust in data and data teams, while still scaling efficiently. Simultaneously, you can leverage that Active Metadata Layer to pass metadata to your AI tooling to put results into an appropriate business context.
This also enables you to ship data products faster. Features like the visual editor and dbt Assist in dbt Cloud considerably lower the ramp up time for using dbt, making onboarding more contributors, including less technical users, doable safely and quickly.
Lastly, using dbt Cloud allows for reducing duplication and reusing things like jobs to make the most cost-effective use of your compute for data transformations.
Hybrid dbt Core/Cloud in practice
I’ve typically found that central data teams will be using dbt Core. At the same time, domain expertise teams closer to the business side—teams like marketing, sales, and people operations—were taking those data products and using them on other platforms. Think Google Sheets and Excel. Since these teams weren’t using dbt, there was no governed connection for this.
After implementing a hybrid deployment, those teams were instead provided tools to use that data and those data products within the envelope of dbt. That way, the data team maintains visibility into what those other teams are doing with the data
That makes building a multi-step mesh across data domains possible. It also gives domain teams a wider, world-class set of tools to work with, thanks to dbt Cloud.
Merging a sandwich shop and circus
Take the theoretical example of a merger between the Jaffle Enterprises and the Cirque du Jaffle businesses. Although the example is intended to be humorous, the challenges it presents are often found in real-world organizations. Whether it be a merger or acquisition, combining two data teams, or a platform migration, the environments quickly become complicated.
In this example, the Jaffle shop using dbt Cloud had all of their sales data in Redshift to provide data products to their BI applications, notebooks, and machine learning. On the other hand, the circus was using concessions, ticket sales, and circus personnel data via AWS S3 buckets. They utilized dbt Core with Athena and the AWS Glue Data Catalog to provide data products to their BI applications.
In other words, these teams operated differently despite both using dbt.
What does a hybrid architecture actually look like?
The primary use case for a hybrid architecture is to have a foundational dbt Core project, with downstream teams using dbt Cloud projects that are specific to their domain. That gives them access to the visual editor, dbt Explorer, and other useful tooling.
In this way, you can have your domain teams still utilizing those assets from your dbt Core project, but with the niceties of dbt Cloud.
But…what if your dbt Core and dbt Cloud versions don’t match?
That problem is why dbt is creating the new “Compatible” release track beginning this month. Each monthly “Compatible” release will match the open-source version of dbt Core and adapters at release time. For Enterprise organizations that are a little slower moving and want some extra assurance, the “Extended” release delays this by a month.
dbt enables this approach with dbt Mesh. dbt Mesh is a pattern for collaboration across multiple data projects aligned to business domains. Rather than having one huge dbt project, you can have a collection of smaller, domain-oriented projects.
Each of those projects can build on each other with guardrails in place via software engineering-like interfaces. Those interfaces are defined using contracts, versioning, and access controls. This means domain teams can maintain control of their data pipelines and reference other team’s projects with confidence that nothing will break.
Using dbt Mesh, you can integrate dbt Core and dbt Cloud projects as follows:
- First, prepare your core project for access through dbt Mesh
- Second, mirror each dbt Core “producer” project into dbt Cloud
- Lastly, create and connect your downstream projects to your dbt Core project using dbt Mesh
In the first step, you leverage dbt Mesh to configure your public models to serve as interfaces for your downstream projects. After that, mirroring each project in dbt Cloud enables you to connect those to the dbt Core project in dbt Mesh, ensuring that changes in Core are inherited in dbt Cloud as part of your Mesh Architecture. That’s pretty much it!
Iceberg and dbt Mesh for full multi-platform support
If dbt doesn’t support a platform well, then users of the platform can’t adopt dbt. It’s in everyone’s best interest to continue adding support for more platforms and empower users.
It’s for precisely that reason that dbt Labs moved to support Synapse, Fabric, and Teradata. The new adapter for Amazon Athena, too, gained support for this purpose: customers like Moderna had their entire analytics stack built on AWS, and Athena was a key component of that.
Alongside Athena, Moderna also used Redshift and Iceberg format tables in S3. Apache Iceberg is an open table format standard for storing data and accessing metadata—this means that it’s agnostic to data platforms and compute engines, so you have more flexibility in how and where you access it.
Given these advantages, Iceberg is a hot topic in the data community. That’s why dbt now supports it. All it takes is a single line of code to materialize your dbt models in Iceberg format.
So how does Iceberg fit in with dbt Mesh?
dbt customers with a need to integrate the data estates of multiple companies often found that, to share data between their dbt projects, they needed to consolidate to one platform or replicate that data. This is costly and time-intensive.
By supporting Iceberg in dbt Mesh, you can now have cross-platform references using the same data in a common format without the need for application or movement. Simply changing the table type in your SQL configuration and changing the access configuration on the model is enough to make this data usable within queries in your other project.
Conclusion
As organizations grow and change, their data landscapes become large and diverse. Capabilities like Iceberg and cross-platform data mesh can help you scale the impact of your data operations.
Support for Iceberg is just getting started, too, with more integrations for things like catalogs on the horizon.
Watch our on-demand webinar One dbt: Accelerate data work with hybrid deployments and cross-platform dbt Mesh to learn more about scaling your data operations in a cross-platform way.
Last modified on: Dec 19, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.