dbt
Blog Understanding data mesh architecture

Understanding data mesh architecture

Sep 22, 2023

Learn

In today's data-driven world, organizations are continually seeking better ways to manage and utilize their vast amounts of data efficiently. Traditional data architectures, with their centralized control and monolithic structures, often struggle to keep pace with growing demands. This is where a data mesh comes into play—an approach that decentralizes data management and empowers individual teams to own their data and its lifecycle.

How data mesh changes data architecture

Data mesh is a new approach to data architecture. Rather than managing all data and data processing as a single monolith, it decomposes data into a series of data domains, each owned by the team closest to that data.

In a monolithic data management approach, technology drives ownership. A single data engineering team typically owns all the data storage, pipelines, testing, and analytics for multiple teams—such as Finance, Sales, etc.

In a data mesh architecture, business function drives ownership. The data engineering team still owns a centralized data platform that offers services such as storage, ingestion, analytics, security, and governance. But teams such as Finance and Sales would each own their data and its full lifecycle (e.g. making code changes and maintaining code in production). Moving to a data mesh architecture brings numerous benefits:

  • It removes roadblocks to innovation by creating a self-service model for teams to create new data products
  • It democratizes data while retaining centralized governance and security controls
  • It decreases data project development cycles, saving money and time that can be driven back into the business

Because it’s evolved from previous approaches to data management, data mesh uses many of the same tools and systems that monolithic approaches use, yet exposes these tools in a self-service model combining agility, team ownership, and organizational oversight.

Join us at Coalesce October 7-10 to learn from data leaders about how they built scalable data mesh architectures at their organizations.

The challenges of monolithic data management

In a monolithic data management approach, technology drives ownership. A single data engineering team typically owns all the data storage, pipelines, testing, and analytics for multiple teams—such as Finance, Sales, etc. In other words, a single team is a single bottleneck to data processing.

The distance between the data processors and the data owners results in long load-test-fix cycles that delay data project delivery times. And the complexity of these systems means few people understand how all the parts work together.

None of these problems are new. But they’ve become more noticeable as data architectures grow more complex in scope and scale.

ETL and its drawbacks

Before ‌cloud systems, data was stored on-premises. Companies stored most of it in large online transaction processing (OLTP) systems built on a symmetric multi-processing (SMP) models like Teradata.

Some companies—those who could afford it—stood up data warehouses. Built around an online analytical processing (OLAP) model and massively parallel processing (MPP), data warehouses gave data analysts, business analysts, and managers faster access to data for reporting and analytics.

To load data into data warehousing systems, IT or data engineering teams used Extract, Transform, and Load (ETL) processes. This cleaned and transformed the data into a state suitable for general use.

While this approach worked, it had numerous drawbacks:

  • The expense of building data warehouses on-prem prohibited many companies from leveraging the technology.
  • Most workloads had to go through the IT and (later) data engineering teams, which quickly became bottlenecks
  • These centralized solutions produced one-size-fits-all datasets that couldn’t meet the needs of every team’s use case
  • Many ETL pipelines were hard-coded and brittle, resulting in fire drills whenever one went down due to bad data or faulty code
  • Some teams, frustrated by these bottlenecks, formed their own solutions, resulting in data silos and the growth of “shadow data IT”

Hadoop enabled utilizing clusters of machines to process massive datasets in parallel. This gave teams more flexibility than a centralized data warehouse architecture offered.

But Hadoop also has multiple drawbacks. First off, it was hard to work with; query tools were limited, and resource management and performance remained a constant struggle. In addition, because everyone ran their own Hadoop clusters, it made the shadow data IT issue worse. Multiple teams were now managing their own unmonitored, unregulated data silos. It also pushed a lot of business logic out of data pipelines and into front-end BI tools.

ETL to ELT

As dbt Labs founder Tristan Handy has written before, the emergence of Amazon Redshift changed the game. Cloud computing eliminated the need for massive up-front capital expenditures for large-scale computing projects.

The availability of on-demand computing power also changed the way we processed data. Because cloud computing was so affordable, many teams could move from ETL processes to an ELT—Extract, Load, Transform—model that transformed unstructured and partially structured data after loading it into a data warehouse. Because it leveraged the processing power of the target system and instituted a schema on read approach, ELT often proved more performant and more flexible.

From data warehouses to data lakes and data catalogs (where we are today)

Despite these advances, some of the old problems with data management continued to linger. Concepts such as version control, testing, and transparent data lineage were mostly unheard of. Thus, data warehouses were still primarily storehouses for structured data.

The cloud didn’t solve the data silo problem; if anything it helped data silos proliferate. That made it harder than ever for data teams and business users to find the data they needed.

Data lakes addressed the first problem of storing unstructured data.

Data catalogs became an increasingly popular way to search, tag, and classify data.

The architectural elements of a data mesh implementation

Data mesh is a decentralized approach to data management that enables individual teams to own their own data and associated pipelines. Done well, data mesh balances two competing yet important priorities: data democratization and data governance. It unblocks data domain teams from waiting on data engineering to implement their data pipelines for them, which enables faster time-to-market for data-driven products.

It also enables data engineering teams to be specific about which datasets are “consumption ready” and as a result, what standards they agree to meet for those data. Data mesh combines this independence with security and policy controls that prevent ‌data democracy from degenerating into a data anarchy.

With data mesh architecture, teams leverage a shared infrastructure consisting of core data as well as tools for creating data pipelines, managing security and access, and enforcing corporate data governance policies across all data products. This architecture enables decentralization while ensuring data consistency, quality, and compliance. It also allows teams to leverage centralized functionality for managing data and data transformation pipelines without each team re-inventing the wheel.

Central services

The central services component of a data mesh architecture implements the technologies and processes required to enable a self-service data platform with federated computational (automated) governance. It’s further subdivided into two areas: management and discovery.

Management

Management includes functions for provisioning software stacks necessary to process and store data. This software stack will form the data platform that will then be leveraged by various domain teams. Central services implements a solution that creates the resources a team needs to manage a new stack.

Self-service data stacks include a standardized set of infrastructure that each team can leverage. This includes storage subsystems (object storage, databases, data warehouses, data lakes), data pipeline tools to import data from raw sources, and ELT tools such as dbt. They also include tools for creating versioned data contracts so that teams can register and expose their work to others as a reusable data product.

Management also includes federated computational governance. It enforces access controls, provides tools for enabling and enforcing data classification for regulatory compliance, and enforces policies around data quality and other data governance standards. It also provides centralized monitoring, alerting, and metrics services for organizational data users.

Discovery

Because central services acts as a clearinghouse for managing an organization’s data, it also serves an important discovery function. Users can use a data catalog to search organizational data and find both raw data and data products that they can incorporate into their own data sets.

Producers (data domains)

The producers represent the collection of data domains owned by data teams. Architecturally, producers make use of the stacks provisioned for them by centralized services. The producers leverage one or (usually many) more data sources through data pipelines to create new data sets.

The output of the producers’ work is one or more data products. A data product exposes a subset of the producers’ data that other teams can leverage in their work. Each data product may contract that specifies the structure of the data it exposes, as well as access policies that control who can see what data and code.

Consumers

Consumers take the output from the producers and use it to drive business decisions. Consumers can be salespeople or decision-makers developing BI reports, analytics engineers further refining data for data analytics, data scientists building machine learning or AI models, or others.

Additionally, producers and consumers often overlap. A team can be a consumer of one team’s data while also producing data that another team uses. Because every team publishes its output as data products that others can discover through the central data catalog, it’s easy for teams to build a web of connectivity between each other. This is what puts the “mesh” in “data mesh.”

How the pieces fit together

With all of these pieces in place, a workflow between all of these architectural components emerges. Data governance leaders—a combination of business stakeholders and members of the data platform team—define policies for data governance and data quality. Centralized services then implement support for self-service data products and federated centralized governance, enforcing data governance policies via code.

From there, data producers use the self-service data platform to create a new stack they can use to create a new data product, using the data catalog supported by central services to find other data and data products. Once ready, producers publish the initial version of their data product to the data catalog, where consumers and other producers can find them. Consumers find and utilize data products either as an end product in itself (reports) or as input to another process (a machine learning model).

As data and business requirements evolve, data producers release new versions of their data products with new contracts to preserve backwards compatibility. Consumers and other producers receive alerts about the new version of the contract for the data product they’re consuming, and update their workflows to use this new contract before the previous version expires.

Centralized services and data governance leaders work to onboard more teams to the self-service data platform and use metrics on data quality and usage from the data catalog to measure progress towards KPIs.

What data mesh brings to modern data architecture

These components of data mesh all work together to bring a number of benefits to your existing data architecture.

Scalability

Scalability comes from two areas. First, it comes from the self-service data platform. In a monolithic data management architecture, employees who wanted a certain report or data set created would have to send the request to a central data engineering team. That inevitably results in large backlogs and delays. With a self-service platform, data producers can receive the capability they need to create new data products automatically.

Second, scalability comes from the data producer layer itself. Each team (and possibly each data product) can request the computing resources it needs to store, transform, and analyze data. Each data domain, architecturally, runs as its own separate data processing hub.

Increased trust in data

The centralized services layer supports a data catalog that enables all data producers to register their data products and data sources in a single location. The data catalog serves as the single source of truth within the company. This enables producers to own their own data domains while the company enforces data quality and classification standards across all owners.

Via the data catalog and other data governance tools, the data governance team can quantify and track the quality of data across the company. For example, it can report statistics on the accuracy, consistency, and completeness of the data it monitors, as well as produce reports on how much of the company’s data is properly classified.

Finally, because data domain teams own their own data, they can ensure that it’s kept up to date and that its structure reflects the changing realities of their business. All of these factors lead to an increased trust in data as companies move closer towards a data mesh architecture.

Greater reliability and reduced rework

The self-service data platform also helps enforce uniformity across data domains through the use of contracts for data products.

One of the primary sources of data issues is disruptions caused by sudden and unexpected changes in the format of data. By enforcing the need for data contracts on data products, producers can alert consumers to upcoming breaking changes. This improves reliability across the data ecosystem. That saves everyone involved time and further increases confidence and trust in the company’s data.

Faster deployment of data products

The increased scalability, increased trust in data, and greater reliability of data mean teams can bring new data products to market faster.

One of the largest obstacles to launching new data products is finding data you can trust. In a 2022 study by Enterprise Strategy Group, 46% of respondents said that identifying the quality of source data was an impediment to using data effectively. By building increased trust in data, companies can empower data domain teams to ship innovative ideas in less time.

Data mesh uses new and existing data management technologies to create a distributed, federated approach to managing data at scale. Understanding what each layer contains and how each one works together gives you a roadmap for transitioning to the next evolution of modern data architecture.

Join us at Coalesce October 7-10 to learn from data leaders about how they built scalable data mesh architectures at their organizations.

Last modified on: Oct 16, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts