Data mesh is defined by four principles: data domains, data products, self-serve data platform, and federated computational governance. These principles can be hard to grasp at first, especially since they interact and overlap. This article provides a high-level overview of each principle and how they interconnect.
What is data mesh?
Data mesh is a decentralized data architecture approach that treats data as a product and emphasizes domain-oriented ownership. In contrast to traditional data architectures, where data is often centralized in a monolithic data warehouse or data lake, data mesh advocates for distributing data ownership and responsibility across different business domains within an organization.
In a data mesh framework, each domain (such as marketing, sales, finance, etc.) is responsible for its own data—collecting, managing, and serving it in a way that makes it easily consumable by others within the organization. This approach mirrors the decentralization seen in microservices architectures, where autonomy is given to teams closest to the data source and consumers.
Join us at Coalesce October 7-10 to learn from data leaders about how they built scalable data mesh architecture at their organizations.
What challenges does data mesh solve?
Traditional centralized data architectures often lead to several challenges as organizations grow.
Bottlenecks
One of the primary issues is the creation of bottlenecks. Centralized data teams can quickly become overwhelmed with requests from various departments, leading to delays in data access and processing. This bottleneck not only slows down the organization’s ability to gain insights but also stifles innovation as teams wait for their data needs to be met.
Data mesh directly addresses these challenges by decentralizing data ownership. Instead of relying on a central team, each domain manages its own data, reducing the load on central resources and speeding up data availability.
Data silos
Additionally, data mesh helps eliminate data silos. In traditional models, data is often isolated within specific departments, making cross-departmental analysis difficult. By promoting a collaborative approach to data management, data mesh ensures that data is easily accessible across the organization, fostering a more integrated and holistic view of business operations.
Scalability
Scalability is another significant challenge that data mesh addresses. As organizations grow, the volume and variety of data can quickly overwhelm centralized systems. Data mesh’s decentralized approach allows each domain to scale its data infrastructure independently, ensuring that growth in data volume does not compromise performance or efficiency.
Data quality
Finally, data quality improves under a data mesh architecture by making each domain responsible for the data it produces. In centralized systems, maintaining data quality across all domains can be a challenge, leading to inconsistent and unreliable data. By contrast, data mesh ensures that those closest to the data are accountable for its accuracy, leading to higher overall data quality across the organization.
Principles of data mesh
Zhamak Dehghani, the progenitor of the data mesh architecture, laid out the four principles of data mesh during her time at Thoughtworks.
These principles are key because data mesh requires more than just re-architecting your data-driven applications. It involves a mindset shift in how organizations manage data. It also takes diligence to achieve the correct balance between agility and effective oversight.
Read: Data mesh architecture: How did we get here?
You can use these four principles in guiding your organization through its own data mesh journey. Let’s look at each one in detail.
Principle 1: Domain-driven data ownership
The foundational principle of data ownership is that individual business domain teams should own their own data. This idea is built on Eric Evans’ work around domain-driven design. The aim of domain-driven data ownership is to align responsibility with the business (marketing, finance, etc.), and not with the technology (data warehouses, data lakes).
As an example, consider a fictional e-commerce company. At the highest level, the company breaks down into natural data domains by department and function: e.g., sales, marketing, finance, engineering.
This principle of data mesh is critical and is also often the hardest to come to terms with. It requires a fundamental shift in how organizations approach data. Principally, it requires a new data architecture to support it: a self-serve data platform operating on a federated governance model. (We’ll dive into those principles further below.)
However, when done correctly, domain-driven ownership brings a number of benefits to an organization compared to the data monoliths of old:
- Clear ownership and demarcation. One of the persistent problems with existing data architectures is determining who owns a given data set. With domain-driven ownership, teams own their own data and register their ownership with a data catalog. In our example above, this means finance would own financial data, sales would own sales data, and so on.
- Scalability. In data monolith architectures, a single data engineering team must own and manage the data and associated business logic for every team across the organization. That creates bottlenecks. With domain-driven ownership, individual teams own their own data and pipelines, which distributes the burden.
- Increased data quality. Centralized data engineering teams don’t always have all the information they need to make the best decisions about what constitutes “good” data. Yielding these decisions back to the data domain owners results in better decision-making and better data quality across the organization.
- Faster time to market. Data domain teams are more knowledgeable about their own data and own their own data pipelines and business logic. This means, on average, that they can deliver new solutions in less time than if they had to hand the full implementation over to a centralized data engineering team.
Principle 2: Data as a product
No team should live in a silo. Teams need to share data with one another constantly. This is where the mesh portion of data mesh comes in. Each team defines not just the data that they own, but what data they produce and consume from others. This is where the idea of a data product comes in.
A data product is a well-defined, self-contained unit of data that solves a business problem. A data product can be simple (a table or a report) or complex (an entire machine learning model).
How do data products achieve this?
- Interfaces that define how the team exposes data to others: e.g. the columns, data types, and constraints that compose a given data product
- Contracts that serve as written specifications for interfaces and that all teams can use to validate their conformance to the interface
- Versions to define new revisions of a contract while supporting previous versions for backward compatibility
- Access rules that define who can access what data in a data product. For example, the product should mask sensitive data fields—such as personally identifiable information (PII)—from personnel who do not have a clear business reason to see it.
Principle 3: Self-serve data platform
Domain data teams manage their own data products from nuts to bolts. This includes all associated ingestions, data transformations, data quality testing, and analytics. But as mentioned above, it doesn’t make sense for every domain data team to stand up this toolset on their own. Most won’t have the time or capability. At best, this results in redundant contracts (e.g., multiple teams licensing multiple data storage solutions from different vendors) and incompatible tooling.
A more efficient way to manage this is by creating a self-serve data platform. This platform consists of the tools that data domain teams need to ingest, transform, store, clean, test, and analyze data. The self-service data platform will also define other tools, such as:
- Security controls for managing access to data
- A data catalog for registering, finding, and managing data products across the company
- An orchestration platform for governing access to data products and provisioning them
Without this self-service platform, many teams will lack the tools required to join the data mesh. By enabling these tools, the data platform team unlocks the scalability of a data mesh architecture.
Principle 4: Federated computational governance
A data mesh can become a data anarchy without proper governance controls. That’s why the final—and perhaps most important—principle of data mesh is federated computational governance.
In a data mesh architecture, while domain teams own their data products, the data platform and the corporate data governance team track and manage compliance centrally via a data catalog and data governance tools.
Federated computational governance enables data governance at scale. Automated policy enforcement reduces the manual labor required to remain compliant with the growing, complex body of data regulations worldwide. And when there are issues, the data domain team—the data owners—can resolve them quickly.
Best practices when implementing the principles of data mesh
These principles are key because data mesh requires more than just re-architecting your data-driven applications. It involves a mindset shift in how organizations manage data. It also takes diligence to achieve the correct balance between agility and effective oversight.
Principle 1: Domain-driven data ownership
It doesn’t make sense for every domain data team to reinvent the wheel. To support domain-driven data, organizations should task a data platform team to create the necessary infrastructure that enables domain teams to manage their own data.
A centralized data mesh enablement architecture offers centralized services for data management, including storage, orchestration, ingestion, transformation, cataloging and classification, and monitoring & alerting.
On the data domain side, teams need to define their own data contexts and data products (which we’ll discuss more below). They may also want to have embedded data engineers and analytics engineers to support managing their own data pipelines and reports.
Principle 2: Data as a product
To implement data as a product, data platform teams need to support tools for defining, validating, and versioning contracts.
Treating data as a product also means managing it as a product. Data domain teams should work to understand and document their existing data workflows, as well as create backlogs to track and manage upcoming releases.
An important success factor in data mesh is a data enablement team. The enablement team assists domain teams in making the shift to data as a product by defining modeling best practices, designing reference examples, and training users on tools and processes. Organizationally, the data enablement team is often a part of the data platform team.
Principle 3: Self-serve data platform
The self-serve data platform is where the data mesh moves from theory to reality. So it’s critical that the data platform team and data domain teams work closely together to create a toolset that works for all stakeholders.
The most successful tool sets will be those that leverage technology that your engineers and analysts already know. For example, using a tool like dbt for data transformation requires less ramp-up time, as it leverages a language—SQL—that most engineers and analysts already know and use daily.
When introducing a major change such as a self-serve data platform, it’s best to start small. The data platform team should gather requirements as broadly as possible—but implementation should start with a single data domain team. After onboarding a single team and incorporating their feedback, the data platform and enablement teams can onboard the next team and then the next. Along the way, the platform and domain teams iterate over the toolset and processes, while the enablement team expands its training and repertoire of best practices.
Principle 4: Federated computational governance
After establishing firm data governance policies, the data governance and data platform teams invest in tools that support federated computational governance. Many data catalogs offer robust data governance tools out of the box or as add-ons purchased separately. Data platforms and data transformation tools also provide governance features like role-based access control, testing, and model governance.
It’s the data platform’s job to convert data governance policies into automated governance. This includes setting appropriate access controls, enforcing classification rules, establishing rules for data quality, and configuring anomaly detection, among others.
The data governance team, which itself is comprised of domain experts, work with the enablement and data domain teams to educate everyone on data governance best practices, including the domain teams’ new responsibilities as data owners.
Introducing dbt Mesh
As organizations adopt data mesh architectures, the need for tools that support decentralized data management becomes increasingly important. dbt Mesh, a powerful extension of dbt Cloud, is designed to help organizations manage and collaborate on data transformations across multiple domains in a data mesh architecture.
Download our comprehensive guide to data mesh to gain a rich understanding of everything you need to begin your own data mesh journey. In this guide, you’ll learn:
- Problem-solving with data mesh: Learn about the challenges that data mesh solves, including data bottlenecks, inefficiency, and the loss of context in traditional centralized systems.
- The core principles of data mesh: Dive deep into the foundational elements of data mesh. Understand how domain-oriented decentralized ownership, data as a product, self-serve data infrastructure, and federated computational governance can transform your data ecosystem.
- Crafting your data mesh: Apply data mesh principles in practice to build a mesh architecture tailored to your organization’s unique needs and goals.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.