At Coalesce 2024, we announced that we are building a new, multi-platform data governance and sharing capability in dbt Cloud called cross-platform dbt Mesh. This is a natural evolution of dbt Mesh: whereas dbt Mesh originally supported references within a single data platform, cross-platform dbt Mesh allows you to coordinate dbt across data platforms. This capability is made possible by rapid adoption of Apache Iceberg in the ecosystem, and pairs with our launch of Iceberg support in dbt. Our team is actively working with design partners to develop this capability, and we're iterating toward a beta that will include support for Athena, Databricks, Redshift, and Snowflake. As we move towards GA, we will continue to add support for any Iceberg-compatible platform.
We're excited to build on the momentum behind dbt Mesh as the means for managing data complexity at scale. And, further, we're excited to empower various teams within enterprises to collaborate on data, regardless of which data platform they choose.
But, before we talk about the what and how of cross-platform dbt Mesh, I want to share a reflection on how we got here, and where we're going.
The need for multi-platform flexibility
dbt adoption and data complexity have exploded over the past eight years. In the first few years, we considered a project with a few hundred models to be very complex; today, 5% of dbt projects (>2,000 of them) have over 5,000 models. Over the same time horizon, we've observed analytics development becoming more like software development: as the dbt workflow drives more value, more people are engaging with it than ever, more assets are under management than ever, and the systems we are building encapsulate more complexity than ever. Practitioners need better approaches and tools to tame that complexity, because, just like in software engineering, building and managing complex systems is hard. We also think it's one of the most interesting and important challenges in data today.
There is one dimension of complexity that we have not addressed until now. In the early days of the modern data stack, we envisioned organizations choosing a single best-in-class tool for each step in the process: one tool for extraction, one tool for warehousing, one tool for transformation, one tool for BI, etc. That is not the reality we see in most enterprises today: data stacks are less like a single thread, and more like a patchwork quilt. Individual buyers buy the tools that suit their team, but in an enterprise context, that means many buyers are buying on behalf of many teams. It’s become clear that the question leaders are asking isn’t: “Will we adopt platform A or B?” Instead, it is: “Our organization has clear use cases for platform A and B. How can we embrace both platforms and still foster governed collaboration, data velocity, and data trust?”
To make this concrete: half of enterprises that use dbt Cloud are working across multiple different data platforms. This would have surprised me in 2016, but over the last few years I have come to believe that it's a wonderful thing. Practitioners love the tools that they love, and teams purchase the tools that suit their needs. From there, at an organization level, the tools and workflow should stitch these various systems together to create a seamless experience. Just like software engineering, I don't care if you use emacs or vim, git CLI or GUI. As long as you are a good citizen of the software engineering workflow, you get your code reviewed, you follow the style guide, and you don't break anything; then teams happily coexist even when using very different tools.
Today, working within a single data platform works well, even with multiple projects. But when it comes to working across platforms, the seams in the quilt are more like chasms. Practitioners across different business units aren't able to discover or re-use the contents of the "other" platforms; architects aren't able to govern how data is exchanged. Work ends up happening in siloes, and we hear things like, "I have no idea what the other team is doing – they don’t use the same platform that we use."
At best, we see practitioners deploying duct tape, glue, and Python notebooks to hold these stacks together (if they are held together at all). That may be necessary today, but it looks to me like a step backward. We should aspire to avoid using "hacky" or heavyweight tools to discover and share data, unless we really need them. These breaks across tool boundaries lead to breaks in the Analytics Development Lifecycle (ADLC). We aspire to treat analytic code as an asset, rather than making transactional requests for data ("can you re-run your notebook to refresh the data in my system?"). We aspire to empower practitioners to "put on the hat" and participate in the process, regardless of which tools exist in their organizational ecosystem. We aspire to empower administrators to govern data and the process by which it is produced. And, we aspire to enable discovery and re-use and avoid duplicative work.
If we could make data discoverable and governable across these data platforms, then we'd get the power of the workflow across an enterprise, rather than just within a team or business unit.
It's time to stitch these seams between data stacks together across data platforms.
Introducing cross-platform dbt Mesh
dbt Mesh lets you break down a monolithic application into constituent parts, govern how different datasets can be used downstream, and discover lineage across multiple projects. But it breaks down at the data platform boundary, which is also frequently a team boundary. And, so, we've asked our customers: what if getting unified data workflows was as easy as applying dbt Mesh across projects sitting on different platforms? If a cross-project ref "just worked" across those projects, regardless of platform? Is that something you would want?
When we talk to enterprises using dbt, the answer we resoundingly get is “YES.” More than a simple “yes,” it’s almost always a gigantic-sigh-of-relief-yes. So, that's exactly what we are building.
Today, we’re excited to announce that we are adding cross-platform ref support to dbt Mesh.
"Cross-platform dbt Mesh makes the promise of data mesh an actual reality for us. Now, it will be possible to work on an organization-wide data model—one that all teams can contribute to and consume from—regardless of what that team's tech stack looks like. Cross-platform dbt Mesh gives the technology diversity of our data ecosystem a common denominator that we can all build around."
- Ulrik Svanborg Møller, Lead Data Engineer, Vestas Wind Systems
How does cross-platform dbt Mesh work?
Cross-platform mesh leverages open table formats (in particular, Apache Iceberg) to interchange data. Today, Iceberg support exists, but is limited: each platform only supports a subset of the Iceberg spec. That support is becoming more complete and widespread with each passing quarter. The trend is a rapid movement toward complete support, and we're putting significant effort at dbt Labs into participating in this evolution. The end result of open table format support will be that data platforms can seamlessly interchange data, at least at the edges. And that seamless interchange is what enables cross-platform mesh.
Adopting Iceberg is a prerequisite to using cross-platform mesh. That could be an intimidating prospect. For now, just know that this does not require migrating every single table you manage to an Iceberg catalog. You do need access to an Iceberg catalog where you can "stage" public models so that they can be referenced by downstream projects. Once you have that set up, you can "share" public models across warehouses without copying data.
"By truly separating storage and compute, the technological barriers that contribute to siloed thinking will be eradicated. With cross-platform dbt Mesh, self-service business users can participate in the data workflow, without having to worry about managing refreshes or data synchronization. This levels the playing field so more people can generate value from our organizational information."
- Ulrik Svanborg Møller, Lead Data Engineer, Vestas Wind Systems
The cross-platform dbt Mesh beta will include support for Athena, Databricks, Redshift, and Snowflake. As we move toward GA, we plan to add support for any Iceberg-compatible platform.
In the example below, you can see that an Athena upstream can be referenced by a Redshift downstream.
No data is copied or duplicated as part of this process. Everything just works. Your lineage will render in dbt Explorer, and dbt builds will immediately pick up any changes to upstream models computed in a different warehouse. The benefits you get from dbt Cloud and dbt Mesh translate to this cross-platform mesh: it unlocks governance and re-use at scale.
Zooming into a model level, it works like this:
- In the upstream and downstream warehouses, integrate both warehouses with the same Iceberg catalog.
- In the upstream project, classify your model as public via dbt Mesh, and configure it to be written into your Iceberg catalog.
- In the downstream project,
ref
the upstream model. - Under the hood, dbt Cloud translates the
ref
to point to the right place in the Iceberg catalog. - When dbt Cloud executes the upstream, it writes data into the Iceberg catalog.
- When dbt Cloud executes the downstream, it looks up the table in the catalog, and then loads data directly from the Iceberg store.
The result: you can build a model in an upstream project, ref
it from a model in a downstream project, and the newly-built model will be immediately available for consumption by the downstream model.
What's next for cross-platform dbt Mesh?
We're actively iterating towards a beta of cross-platform mesh with a few select design partners. We're focused on building complete Iceberg compatibility, supporting a broad set of platforms, and sanding off the rough edges of this new capability. If your company uses more than one of Athena, Redshift, Databricks, or Snowflake, and you are interested in learning more about how to use this, contact your account team—we'll keep you informed when we go into beta in the coming months.
Last modified on: Oct 08, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.