DevOps emerged years ago as a methodology that aimed to break down the silos that had grown between software engineers and IT engineers. The goal was to accelerate the pace of software deployments through shared ownership and automation.
DevOps is now recognized across the software development industry as a success story. However, the data industry has remained far behind our software development counterparts in embracing a DevOps mindset. Too often, the people who create data pipelines and data models and those who consume that data don’t have the tools or visibility needed to scalably collaborate with velocity. This results in time-consuming and expensive rework that delays the shipment of new data products and impairs trust in data and data teams.
The good news is, it doesn’t have to be this way. In this article, we’ll look at the DataOps development pattern, how it’s patterned after DevOps, the value it offers, and some of the tools you can use to implement it.
The DevOps blueprint
Before DevOps, most software engineering projects followed a waterfall model. Requirements, design, implementation, testing, and deployment all took place in strict sequential phases. The teams responsible for each phase often worked in isolation from one another.
In addition to significantly bogging down release timelines, this led to chaos when it came time to ship software. Software engineering teams would build apps that they “threw over the wall” to the IT department to manage and maintain. IT would then struggle to operationalize the application—fixing configuration issues, resolving dependency conflicts, and ensuring everything performed well at scale. This resulted in back-and-forth cycles between the two teams that could delay deployment for weeks or even months.
DevOps - a portmanteau of “development” and “operations” - solved this problem via three major changes:
- People: Software engineers, IT engineers, and other project members work together to design, build, test, deploy, and maintain software. Project stakeholders are also included to ensure requirements align with business needs.
- Processes: Instead of a waterfall approach, teams adopt an Agile methodology, building and releasing smaller iterations of a software system in tight, two-to four-week long cycles, or sprints.
- Tools: Development and operations teams build automated, tested deployment pipelines, making it easy to stand up a full application environment with a simple pull request.
This approach reduced development timelines by reducing friction between teams, promoting shared responsibility for software’s performance in production, and breaking work up into smaller components. Collaboration, shorter release timelines, and deployment automation improve deployment cycles, accelerating time to market and reducing costly and time-consuming errors.
What is DataOps?
DataOps is a framework for managing data that removes silos between data producers( the creators of data products), and data consumers (the users of data products). In DataOps, data producers work closely with data consumers in short, rapid deployment cycles to design, develop, deploy, observe, and maintain new data products that align closely with data consumers’ evolving needs and business goals.
DataOps recognizes that a successful data project requires the joint expertise of both data producers and data consumers. Data producers are experts in data technology - integrating and transforming data, storing data efficiently, securing access, etc. Data consumers, on the other hand, are experts in what they need from the data and how they can use it to execute winning business strategies.
For example, assume that a Finance team needs a new set of data to drive reports tracking sales trends. In a pre-DataOps world, they might log a request to a data engineering team, which then delivers a final product based on the limited information in the support ticket. The Finance team discovers that the data is missing important inputs or parameters, or that it varies greatly from the metrics in a similar data pull last month, so they don’t trust it and won’t use it and they shoot the request back for fixes.
This repeats over days or weeks. Meanwhile, the Finance team is left without the reporting it needs to drive key business decisions.
In a DataOps approach, the Finance team and data engineers would meet to discuss the Finance team’s needs in detail. This would include the data required, its shape, the format and correct calculation of fields, allowable values, and importantly: how they intend to use the data outputs to make strategic decisions.
The data engineering team would then develop a data product that pulls in all of the correct sources, models the data appropriately, and tests and documents the new data models. As in DevOps, the team would use automated deployment tooling such as CICD to verify, test, and release the new data product to the Finance team.
From that point, the Finance team can use the new data product as part of its daily workflows. At any time, they can communicate change requests back to the data engineering team—e.g., adding a new data source, or changing the format of certain data.
In a DataOps model, the Finance team is assured they have an efficient communications channel for logging these issues—and that the data team will respond quickly with new iterations of the data set. Similarly, the data engineering team has the tooling required to handle these change requests without feeling overwhelmed.
As in DevOps, the teams then repeat this cycle rapidly, continuously delivering improvements in line with business needs.
Benefits of DataOps
There are a number of benefits to a DataOps approach—for both data product development and for the organization as a whole.
Development value of DataOps. Like DevOps, DataOps accelerates data development timelines. It does this by focusing on business value from the beginning, aligning teams together from across different domains. It also gives data teams the tooling and resources they need to translate raw data into actionable insights in a way that's automated, modular, and tested. It streamlines development cycles by scoping them to smaller cycles that deliver value with every release.
Organizational value of DataOps. Organizationally, DataOps erodes the silos between data teams and their business stakeholders. This increases collaboration and knowledge sharing, resulting in a better final product and more strategic business decision making. The rapid releases and automated deployments associated with DataOps reduce bottlenecks, delivering more business value in less time.
DataOps tools
In a pre-DataOps workflow, engineers relied heavily on mechanisms such as stored procedures to drive data transformation pipelines. This approach can’t scale to handle today’s volume of data. It also lacks mechanisms to enable collaboration and guide standardization across the enterprise.
By contrast, DataOps relies on centralized tooling and automation to deliver reliable analytics code. This requires using a number of tools in concert to collaborate on data modeling and analytics code, standardize and centralize shared libraries, test data, version code, automate deployments, and observe the results.
Plan: In the planning phase, teams may use tools to help define centralized SLAs for data. They may also use tooling to track metrics such as data freshness or define a data mesh architecture to enable domain teams to work independently on their own data.
Build: Data engineers will typically use command-line tools such as Git for version control, as well as tools such as the dbt Cloud CLI (which includes a VS Code extension) or the dbt Cloud IDE for developing data models and transformation logic. Platforms like dbt Cloud provide a standardized model for building, using, and sharing data models.
Deploy: Deployment is often driven through tools like GitHub, where data engineers will create pull requests (PRs) for code changes that other engineers review prior to merging into production. The data engineering team can use tools such as GitHub Actions or dbt Cloud Continuous Integration (CI) support to validate test code changes and catch breaking changes or unexpected behavior before new data is delivered downstream.
Monitor: The data team can use standard monitoring tools to ensure acceptable query performance and data quality, and to track the usage and adoption of a new data product across the organization. The team can also run its data tests continuously in production to monitor proactively for data issues before they negatively impact downstream consumers.
Catalog: The organization can explore, optimize, and leverage existing data models and data products (reports, visualizations, applications, etc.) in a centralized data catalog. A data catalog helps eliminate data silos by providing a single source of truth for finding, using, securing, and governing an organization’s data. Data catalogs enable data consumers to self-service answers to their own questions about data through features such as documentation, metadata, and data lineage.
Conclusion
For too long, data producers and consumers have worked in separate silos. This lack of communication can slow down—and even doom—many data projects. DataOps can break down this barrier, resulting in higher quality, faster delivery, and better business value.
dbt Cloud provides multiple features that streamline support for DataOps, enabling companies to harness the full value of their data assets. Contact us for a demo to see how dbt can help bring the power of DataOps to your organization.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.