Data mesh is defined by four principles: data domains, data products, self-serve data platform, and federated computational governance. These principles can be hard to grasp at first, especially since they interact and overlap. This article provides a high-level overview of each principle and how they interconnect.
What is data mesh?
Data mesh is a decentralized data architecture approach that treats data as a product and emphasizes domain-oriented ownership. In contrast to traditional data architectures, where data is often centralized in a monolithic data warehouse or data lake, data mesh advocates for distributing data ownership and responsibility across different business domains within an organization.
In a data mesh framework, each domain (such as marketing, sales, finance, etc.) is responsible for its own data—collecting, managing, and serving it in a way that makes it easily consumable by others within the organization. This approach mirrors the decentralization seen in microservices architectures, where autonomy is given to teams closest to the data source and consumers.
What challenges does data mesh solve?
Traditional centralized data architectures often lead to several challenges as organizations grow.
Bottlenecks
One of the primary issues is the creation of bottlenecks. Centralized data teams can quickly become overwhelmed with requests from various departments, leading to delays in data access and processing. This bottleneck not only slows down the organization’s ability to gain insights but also stifles innovation as teams wait for their data needs to be met.
Data mesh directly addresses these challenges by decentralizing data ownership. Instead of relying on a central team, each domain manages its own data, reducing the load on central resources and speeding up data availability.
Data silos
Additionally, data mesh helps eliminate data silos. In traditional models, data is often isolated within specific departments, making cross-departmental analysis difficult. By promoting a collaborative approach to data management, data mesh ensures that data is easily accessible across the organization, fostering a more integrated and holistic view of business operations.
Scalability
Scalability is another significant challenge that data mesh addresses. As organizations grow, the volume and variety of data can quickly overwhelm centralized systems. Data mesh’s decentralized approach allows each domain to scale its data infrastructure independently, ensuring that growth in data volume does not compromise performance or efficiency.
Data quality
Finally, data quality improves under a data mesh architecture by making each domain responsible for the data it produces. In centralized systems, maintaining data quality across all domains can be a challenge, leading to inconsistent and unreliable data. By contrast, data mesh ensures that those closest to the data are accountable for its accuracy, leading to higher overall data quality across the organization.
Principles of data mesh
Zhamak Dehghani, the progenitor of the data mesh architecture, laid out the four principles of data mesh during her time at Thoughtworks.
These principles are key because data mesh requires more than just re-architecting your data-driven applications. It involves a mindset shift in how organizations manage data. It also takes diligence to achieve the correct balance between agility and effective oversight.
Read: Data mesh architecture: How did we get here?
You can use these four principles in guiding your organization through its own data mesh journey. Let’s look at each one in detail.
Principle 1: Domain-driven data ownership
The foundational principle of data ownership is that individual business domain teams should own their own data. This idea is built on Eric Evans’ work around domain-driven design. The aim of domain-driven data ownership is to align responsibility with the business (marketing, finance, etc.), and not with the technology (data warehouses, data lakes).
This principle of data mesh is critical and is also often the hardest to come to terms with. It requires a fundamental shift in how organizations approach data. Principally, it requires a new data architecture to support it: a self-serve data platform operating on a federated governance model. (We’ll dive into those principles further below.)
However, when done correctly, domain-driven ownership brings a number of benefits to an organization compared to the data monoliths of old:
- Clear ownership and demarcation. One of the persistent problems with existing data architectures is determining who owns a given data set. With domain-driven ownership, teams own their own data and register their ownership with a data catalog. In our example above, this means finance would own financial data, sales would own sales data, and so on.
- Scalability. In data monolith architectures, a single data engineering team must own and manage the data and associated business logic for every team across the organization. That creates bottlenecks. With domain-driven ownership, individual teams own their own data and pipelines, which distributes the burden.
- Increased data quality. Centralized data engineering teams don’t always have all the information they need to make the best decisions about what constitutes “good” data. Yielding these decisions back to the data domain owners results in better decision-making and better data quality across the organization.
- Faster time to market. Data domain teams are more knowledgeable about their own data and own their own data pipelines and business logic. This means, on average, that they can deliver new solutions in less time than if they had to hand the full implementation over to a centralized data engineering team.
Principle 2: Data as a product
No team should live in a silo. Teams need to share data with one another constantly. This is where the mesh portion of data mesh comes in. Each team defines not just the data that they own, but what data they produce and consume from others. This is where the idea of a data product comes in.
A data product is a well-defined, self-contained unit of data that solves a business problem. A data product can be simple (a table or a report) or complex (an entire machine learning model).
How do data products achieve this?
- Interfaces that define how the team exposes data to others: e.g. the columns, data types, and constraints that compose a given data product
- Contracts that serve as written specifications for interfaces and that all teams can use to validate their conformance to the interface
- Versions to define new revisions of a contract while supporting previous versions for backward compatibility
- Access rules that define who can access what data in a data product. For example, the product should mask sensitive data fields—such as personally identifiable information (PII)—from personnel who do not have a clear business reason to see it.
Principle 3: Self-serve data platform
Domain data teams manage their own data products from nuts to bolts. This includes all associated ingestions, data transformations, data quality testing, and analytics. But as mentioned above, it doesn’t make sense for every domain data team to stand up this toolset on their own. Most won’t have the time or capability. At best, this results in redundant contracts (e.g., multiple teams licensing multiple data storage solutions from different vendors) and incompatible tooling.
A more efficient way to manage this is by creating a self-serve data platform. This platform consists of the tools that data domain teams need to ingest, transform, store, clean, test, and analyze data. The self-service data platform will also define other tools, such as:
- Security controls for managing access to data
- A data catalog for registering, finding, and managing data products across the company
- An orchestration platform for governing access to data products and provisioning them
Without this self-service platform, many teams will lack the tools required to join the data mesh. By enabling these tools, the data platform team unlocks the scalability of a data mesh architecture.
Principle 4: Federated computational governance
A data mesh can become a data anarchy without proper governance controls. That’s why the final—and perhaps most important—principle of data mesh is federated computational governance.
In a data mesh architecture, while domain teams own their data products, the data platform and the corporate data governance team track and manage compliance centrally via a data catalog and data governance tools.
Federated computational governance enables data governance at scale. Automated policy enforcement reduces the manual labor required to remain compliant with the growing, complex body of data regulations worldwide. And when there are issues, the data domain team—the data owners—are then responsible for responding to and fixing any compliance issues with their data, such as classifying unclassified values or removing sensitive information from access logs.
Best practices when implementing the principles of data mesh
These principles are key because data mesh requires more than just re-architecting your data-driven applications. It involves a mindset shift in how organizations manage data. It also takes diligence to achieve the correct balance between agility and effective oversight.
Principle 1: Domain-driven data ownership
It doesn’t make sense for every domain data team to reinvent the wheel. To support domain-driven data, organizations should task a data platform team to create the necessary infrastructure that enables domain teams to manage their own data.
A centralized data mesh enablement architecture offers centralized services for data management, including storage, orchestration, ingestion, transformation, cataloging and classification, and monitoring & alerting.
On the data domain side, teams need to define their own data contexts and data products (which we’ll discuss more below). They may also want to have embedded data engineers and analytics engineers to support managing their own data pipelines and reports.
Principle 2: Data as a product
To implement data as a product, data platform teams need to support tools for defining, validating, and versioning contracts.
Treating data as a product also means managing it as a product. Data domain teams should work to understand and document their existing data workflows, as well as create backlogs to track and manage upcoming releases.
An important success factor in data mesh is a data enablement team. The enablement team assists domain teams in making the shift to data as a product by defining modeling best practices, designing reference examples, and training users on tools and processes. Organizationally, the data enablement team is often a part of the data platform team.
Principle 3: Self-serve data platform
The self-serve data platform is where the data mesh moves from theory to reality. So it’s critical that the data platform team and data domain teams work closely together to create a toolset that works for all stakeholders.
The most successful tool sets will be those that leverage technology that your engineers and analysts already know. For example, using a tool like dbt for data transformation requires less ramp-up time, as it leverages a language—SQL—that most engineers and analysts already know and use daily.
When introducing a major change such as a self-serve data platform, it’s best to start small. The data platform team should gather requirements as broadly as possible—but implementation should start with a single data domain team. After onboarding a single team and incorporating their feedback, the data platform and enablement teams can onboard the next team and then the next. Along the way, the platform and domain teams iterate over the toolset and processes, while the enablement team expands its training and repertoire of best practices.
Principle 4: Federated computational governance
After establishing firm data governance policies, the data governance and data platform teams invest in tools that support federated computational governance. Many data catalogs offer robust data governance tools out of the box or as add-ons purchased separately. Data platforms and data transformation tools also provide governance features like role-based access control, testing, and model governance.
It’s the data platform’s job to convert data governance policies into automated governance. This includes setting appropriate access controls, enforcing classification rules, establishing rules for data quality, and configuring anomaly detection, among others.
The data governance team, which itself is comprised of domain experts, work with the enablement and data domain teams to educate everyone on data governance best practices, including the domain teams’ new responsibilities as data owners.
Frequently asked questions
What kind of organizations benefit from data mesh?
While data mesh has many benefits, not every company necessarily needs to make the leap tomorrow.
Many companies can improve their approach to data management by making stepwise improvements to their current data architecture. For example, companies that lack easy data discovery, classification, and quality metrics could benefit by introducing components such as a data catalog.
Companies that benefit from data mesh tend to have hit an upper limit with what they can manage using a simple data warehouse or data lake. They have usually hit one of the core problems data mesh was meant to solve:
- Their centralized data engineering team is a bottleneck that prevents new projects from launching quickly; or
- A spike in downstream errors in data pipelines and reports due to a lack of data product-oriented thinking
- Data mesh is a good way for larger organizations with multiple data teams to address challenges with velocity and data quality.
What are the prerequisites for moving toward a data mesh architecture?
Companies should have a general (though not necessarily perfect) understanding of which data domains appropriately belong with which teams. These domains may be usage-oriented (e.g., raw vs. consumable data) or business-oriented (e.g., marketing, advertising, accounting, etc.).
A move to data mesh also requires buy-in from all participants. This includes C-suite sponsorship and, critically, buy-in from data engineers, analytics engineers, data domain teams, business analysts, analytics users, and product managers. If employees aren’t on board with the new plan, it could engender frustration and encourage further use of “shadow IT.”
Finally, the data mesh shift requires a solid training program. All stakeholders should understand what the shift entails and receive proper training on the new tools and processes. In particular, there should be training for data domain teams on what data ownership entails and how to manage data pipelines with the new toolset.
What are the technical components of a data mesh?
Our article on the components of a data mesh architecture touches upon most of the architectural elements of a data mesh. Many are tools and technologies you already use (e.g., object storage, data warehouse, data lakes). Others are new and evolving technologies that support critical data mesh principles, such as treating data as a product.
Here are a few technologies that play a notable role in enabling a data mesh architecture:
- Data storage: All of the technologies you use for structured and unstructured data storage, including object storage, relational databases, NoSQL data stores, data warehouses, and data lakes.
- Provisioning and reservation system: A self-service management layer built by a centralized team that provisions stacks to support data pipelines for domain owners. Usually, this is an Infrastructure as Code (IaC)-driven provisioning system hosted on a cloud provider.
- Data ingestion and transformation: Tools like dbt included as part of data pipeline stacks to build, validate, test, and run data pipelines.
- Data Orchestration: Tools focused on defining which datasets to use when and at what time, or based on conditions in the data itself.
- Data contract validation: A set of tools for defining data interfaces, contracts, and versions, and validating that data products conform to these specifications. One example is dbt’s model contracts feature.
- Data catalog: The single source of truth for all data sources and data products within a company. Data catalogs enable discovering data, managing data ownership, and tracking the flow of data through the company using data lineage. Some companies already have data catalogs before migrating to a data mesh architecture. However, it plays an increasingly important role in data mesh, enabling discoverability across a distributed network of heterogeneous data domain owners.
- Data governance software: Software that implements computational (automated) enforcement of data governance policies, such as data classification to identify sensitive data, data quality rules, and data access roles. This software may be part of the data catalog or a separate platform.
- Self-service reporting tools: BI software that enables teams who have found data and data products via the data catalog to run their own reporting.
- Alerting, monitoring, and reporting: Tools that enable setting alerts on data, such as notifications to downstream teams when a data product changes, help teams maintain data quality over time. Monitoring and reporting shows (among other things) who’s using the data catalog, what data is being used (and which isn’t), and the status of security & compliance across the entire organization.
Can existing platforms and tools be integrated into a data mesh?
Yes! As you can see from the technology list above, data mesh leverages many of the basic data storage and data pipeline tools you’ve used for years.
The difference with data mesh is in who has access to the tools and how access is federated across domains. For example, in a more monolithic approach to data management, data pipeline tools might be under the exclusive control of a centralized data engineering team. In a data mesh architecture, data domain teams can operate their own pipelines independently while also integrating laterally with other teams.
What are the economic benefits of data mesh?
It takes time and resources to build out a data mesh architecture. But most companies find that the effort pays for itself. The savings come from multiple areas:
- A data mesh architecture allows business owners to take control of their own data. That reduces friction between business units and IT, enabling teams to deliver higher-quality data products to market in less time.
- Data catalog and data quality tools help teams find high-quality data more easily. This reduces time spent chasing down up-to-date data sets and ascertaining the veracity of data.
- Federated computational governance automates much of the data governance process, ensuring high-quality and compliant data with less manual effort.
- With a holistic view of all data within the company, the data engineering and governance teams can eliminate redundant data and processes. That means reduced spend on data processing.
For example, a Fortune 500 oil & gas company that uses dbt for data transformation moved to a self-service, distributed data development architecture to scale out its data operations. The company decreased the time it spent on regulatory reporting by three weeks. By democratizing its data tools, it also expanded the number of people working on data modeling projects twofold.
The result: it saved $10 million that it drove back into the business.
How do you achieve good data governance in a data mesh architecture? Particularly, how do you handle sensitive data?
Many people who first hear about data mesh worry about how domain-driven data ownership and a self-serve data platform work with data governance.
How do you prevent unauthorized personnel from seeing personally identifiable information (PII)? How do you respond to a request to delete customer data when a customer’s data may be spread over dozens of teams’ data products?
In other words, how do you prevent your data mesh from becoming a data anarchy?
This is why the data mesh principle of federated computational governance is so important. New data products must onboard to a centralized data catalog. Once registered, data governance automation can ensure the owning team has applied proper access control, classification, and quality controls to its data.
A data product that isn’t discoverable and governed is just a data silo.
What is the relationship between data mesh and DataOps?
Tristan Handy, the founder of dbt Labs, has talked a lot about the need to bring agile software development practices to data projects.
The software engineering world has made great strides in driving higher quality and faster ship times with concepts such as services-oriented architecture, two-pizza teams, and API contracts. Meanwhile, data has remained stuck in a monolithic rut.
DataOps is a collaborative data framework that brings agile development methodology to data projects. While it’s a separate concept from data mesh, it fits very well within a data mesh architecture:
DataOps emphasizes “data product” thinking—one of the four principles of data mesh
DataOps encourages leveraging automation to improve quality and speed of delivery. Automation in the form of a self-serve data platform and federated computational governance is also a core component of data mesh.
As Gartner notes, DataOps is fundamentally a collaborative data management practice. Data mesh encourages collaboration through data discovery, shared standards and frameworks, and data interconnectivity regulated by interfaces and contracts.
What are the major challenges in adopting data mesh?
From a business standpoint, the biggest challenge is that data mesh requires a cultural shift. Some data domain teams may need convincing that owning their own data is the best path forward. Other teams might argue over who should own the canonical version of certain data sets. Data engineering teams may perceive the move to data domain ownership and self-service as “losing control” and push back.
Technically, the first challenge is ensuring a significant portion of the company’s data is available in a connectable format. This might involve using a centralized data warehouse or a connected network of cross-queryable data warehousing stores. Developing a self-service layer and setting up tools for use across dozens or hundreds of teams also requires a significant investment in time and personnel.
The lack of a strong data governance framework can undermine a data mesh project before it begins. Without standards and processes in place for securing data and ensuring compliance, a move from a monolithic to a distributed data architecture can make both security and compliance more difficult.
None of these are insurmountable issues. The keys to resolving them are hosting open discussions and providing a clear definition of the business value and return on investment that you expect the shift to data mesh will provide. Having tools to support and enable more effective collaboration / discussion is critical.
You will, of course, likely run into your own particular challenges as you start your data mesh journey. Remember to remain receptive to feedback and confirm that your proposed solutions will meet the needs of your diverse set of stakeholders. Keeping everyone involved and engaged increases the likelihood that your journey to data mesh will be both successful and rewarding.
Getting started with dbt Mesh
As organizations adopt data mesh architectures, the need for tools that support decentralized data management becomes increasingly important. dbt Mesh, a powerful extension of dbt Cloud, is designed to help organizations manage and collaborate on data transformations across multiple domains in a data mesh architecture.
Download our comprehensive guide to data mesh to gain a rich understanding of everything you need to begin your own data mesh journey. In this guide, you’ll learn:
- Problem-solving with data mesh: Learn about the challenges that data mesh solves, including data bottlenecks, inefficiency, and the loss of context in traditional centralized systems.
- The core principles of data mesh: Dive deep into the foundational elements of data mesh. Understand how domain-oriented decentralized ownership, data as a product, self-serve data infrastructure, and federated computational governance can transform your data ecosystem.
- Crafting your data mesh: Apply data mesh principles in practice to build a mesh architecture tailored to your organization’s unique needs and goals.
Last modified on: Dec 16, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.