Frequently asked questions about data mesh
Oct 02, 2023
LearnIDC predicts the world will have 1.75 zetabytes of data by 2025. Older, monolithic approaches to data management are already struggling to manage and maintain high-quality, secure, and compliant data at this scale.
Data mesh architecture is a new approach to data management that combines data decentralization with federated computational governance. Done well, it improves data quality, reduces time to market, and saves money. But it can be difficult for everyone to wrap their heads around the changes it requires.
We’ve addressed the core concepts behind data mesh, how data architecture has evolved toward data mesh, the four principles behind data mesh, and the components of a data mesh architecture. This article addresses some commonly asked questions we’ve heard about data mesh from our customers.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architectures at their organizations.]
What kind of organizations benefit from data mesh?
While data mesh has many benefits, not every company necessarily needs to make the leap tomorrow.
Many companies can improve their approach to data management by making stepwise improvements to their current data architecture. For example, companies that lack easy data discovery, classification, and quality metrics could benefit by introducing components such as a data catalog.
Companies that benefit from data mesh tend to have hit an upper limit with what they can manage using a simple data warehouse or data lake. They have usually hit one of the core problems data mesh was meant to solve:
- Their centralized data engineering team is a bottleneck that prevents new projects from launching quickly; or
- A spike in downstream errors in data pipelines and reports due to a lack of data product-oriented thinking
Data mesh is a good way for larger organizations with multiple data teams to address challenges with velocity and data quality.
What are the prerequisites for moving toward a data mesh architecture?
Companies should have a general (though not necessarily perfect) understanding of which data domains appropriately belong with which teams. These domains may be usage-oriented (e.g., raw vs. consumable data) or business-oriented (e.g., marketing, advertising, accounting, etc.).
A move to data mesh also requires buy-in from all participants. This includes C-suite sponsorship and, critically, buy-in from data engineers, analytics engineers, data domain teams, business analysts, analytics users, and product managers. If employees aren’t on board with the new plan, it could engender frustration and encourage further use of “shadow IT.”
Finally, the data mesh shift requires a solid training program. All stakeholders should understand what the shift entails and receive proper training on the new tools and processes. In particular, there should be training for data domain teams on what data ownership entails and how to manage data pipelines with the new toolset.
What are the technical components of a data mesh?
Our article on the components of a data mesh architecture touches upon most of the architectural elements of a data mesh. Many are tools and technologies you already use (e.g., object storage, data warehouse, data lakes). Others are new and evolving technologies that support critical data mesh principles, such as treating data as a product.
Here are a few technologies that play a notable role in enabling a data mesh architecture:
Data storage: All of the technologies you use for structured and unstructured data storage, including object storage, relational databases, NoSQL data stores, data warehouses, and data lakes.
Provisioning and reservation system: A self-service management layer built by a centralized team that provisions stacks to support data pipelines for domain owners. Usually, this is an Infrastructure as Code (IaC)-driven provisioning system hosted on a cloud provider.
Data ingestion and transformation: Tools like dbt included as part of data pipeline stacks to build, validate, test, and run data pipelines.
Data Orchestration: Tools focused on defining which datasets to use when and at what time, or based on conditions in the data itself.
Data contract validation: A set of tools for defining data interfaces, contracts, and versions, and validating that data products conform to these specifications. One example is dbt’s model contracts feature.
Data catalog: The single source of truth for all data sources and data products within a company. Data catalogs enable discovering data, managing data ownership, and tracking the flow of data through the company using data lineage.
Some companies already have data catalogs before migrating to a data mesh architecture. However, it plays an increasingly important role in data mesh, enabling discoverability across a distributed network of heterogeneous data domain owners.
Data governance software: Software that implements computational (automated) enforcement of data governance policies, such as data classification to identify sensitive data, data quality rules, and data access roles. This software may be part of the data catalog or a separate platform.
Self-service reporting tools: BI software that enables teams who have found data and data products via the data catalog to run their own reporting.
Alerting, monitoring, and reporting: Tools that enable setting alerts on data, such as notifications to downstream teams when a data product changes, help teams maintain data quality over time. Monitoring and reporting shows (among other things) who’s using the data catalog, what data is being used (and which isn’t), and the status of security & compliance across the entire organization.
Can existing platforms and tools be integrated into a data mesh?
Yes! As you can see from the technology list above, data mesh leverages many of the basic data storage and data pipeline tools you’ve used for years.
The difference with data mesh is in who has access to the tools and how access is federated across domains. For example, in a more monolithic approach to data management, data pipeline tools might be under the exclusive control of a centralized data engineering team. In a data mesh architecture, data domain teams can operate their own pipelines independently while also integrating laterally with other teams.
What are the economic benefits of data mesh?
It takes time and resources to build out a data mesh architecture. But most companies find that the effort pays for itself. The savings come from multiple areas:
- A data mesh architecture allows business owners to take control of their own data. That reduces friction between business units and IT, enabling teams to deliver higher-quality data products to market in less time.
- Data catalog and data quality tools help teams find high-quality data more easily. This reduces time spent chasing down up-to-date data sets and ascertaining the veracity of data.
- Federated computational governance automates much of the data governance process, ensuring high-quality and compliant data with less manual effort.
- With a holistic view of all data within the company, the data engineering and governance teams can eliminate redundant data and processes. That means reduced spend on data processing.
For example, a Fortune 500 oil & gas company that uses dbt for data transformation moved to a self-service, distributed data development architecture to scale out its data operations. The company decreased the time it spent on regulatory reporting by three weeks. By democratizing its data tools, it also expanded the number of people working on data modeling projects twofold.
The result: it saved $10 million that it drove back into the business.
How do you achieve good data governance in a data mesh architecture? Particularly, how do you handle sensitive data?
Many people who first hear about data mesh worry about how domain-driven data ownership and a self-serve data platform work with data governance.
How do you prevent unauthorized personnel from seeing personally identifiable information (PII)? How do you respond to a request to delete customer data when a customer’s data may be spread over dozens of teams’ data products?
In other words, how do you prevent your data mesh from becoming a data anarchy?
This is why the data mesh principle of federated computational governance is so important. New data products must onboard to a centralized data catalog. Once registered, data governance automation can ensure the owning team has applied proper access control, classification, and quality controls to its data.
A data product that isn’t discoverable and governed is just a data silo.
What is the relationship between data mesh and DataOps?
Tristan Handy, the founder of dbt Labs, has talked a lot about the need to bring agile software development practices to data projects.
The software engineering world has made great strides in driving higher quality and faster ship times with concepts such as services-oriented architecture, two-pizza teams, and API contracts. Meanwhile, data has remained stuck in a monolithic rut.
DataOps is a collaborative data framework that brings agile development methodology to data projects. While it’s a separate concept from data mesh, it fits very well within a data mesh architecture:
- DataOps emphasizes “data product” thinking—one of the four principles of data mesh
- DataOps encourages leveraging automation to improve quality and speed of delivery. Automation in the form of a self-serve data platform and federated computational governance is also a core component of data mesh.
- As Gartner notes, DataOps is fundamentally a collaborative data management practice. Data mesh encourages collaboration through data discovery, shared standards and frameworks, and data interconnectivity regulated by interfaces and contracts.
What are the major challenges in adopting data mesh?
Finally, what challenges should you expect to face along the way?
From a business standpoint, the biggest challenge is that data mesh requires a cultural shift. Some data domain teams may need convincing that owning their own data is the best path forward. Other teams might argue over who should own the canonical version of certain data sets. Data engineering teams may perceive the move to data domain ownership and self-service as “losing control” and push back.
Technically, the first challenge is ensuring a significant portion of the company’s data is available in a connectable format. This might involve using a centralized data warehouse or a connected network of cross-queryable data warehousing stores. Developing a self-service layer and setting up tools for use across dozens or hundreds of teams also requires a significant investment in time and personnel.
The lack of a strong data governance framework can undermine a data mesh project before it begins. Without standards and processes in place for securing data and ensuring compliance, a move from a monolithic to a distributed data architecture can make both security and compliance more difficult.
None of these are insurmountable issues. The keys to resolving them are hosting open discussions and providing a clear definition of the business value and return on investment that you expect the shift to data mesh will provide. Having tools to support and enable more effective collaboration / discussion is critical.
It takes time for everyone involved in data management to understand how data mesh will change how they work. This article addresses some of the most common fears and concerns we’ve heard in the field.
You will, of course, likely run into your own particular challenges as you start your data mesh journey. Remember to remain receptive to feedback and confirm that your proposed solutions will meet the needs of your diverse set of stakeholders. Keeping everyone involved and engaged increases the likelihood that your journey to data mesh will be both successful and rewarding.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architectures at their organizations.]
Last modified on: Oct 16, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.