dbt
Blog The Analytics Development Lifecycle: Operate and Observe

The Analytics Development Lifecycle: Operate and Observe

The Analytics Development Lifecycle (ADLC) is a methodology you can use to bring higher quality and repeatability to your analytics projects. With the ADLC, you can use a DataOps approach to ship data products to production more frequently, reducing expensive rework and errors.

In the previous installments of this series, we focused on the Data phase - on how to create, test, and deploy new analytics code changes.

In this and the next article, we're moving on to the Ops phase —operationalizing your analytics code. Below, we’ll dive into how the Operate and Observe phases ensure your code is available, performs well, and is truly free of errors.

Practices in the Operate and Observe phase

In the past, data engineers would work in isolation, deploying large analytics code changes in a haphazard fashion. The ADLC, patterned off of the Software Development Lifecycle (SDLC), uses a DataOps approach to ship smaller changes in an iterative and incremental fashion.

Both sides, Data and Ops, are critical to any successful analytics project:

  • Data: Obtain stakeholder agreement, develop maintainable and reusable code, test your work, and deploy it automatically to staging & production
  • Ops: Observe code as it runs in production, alert and respond to errors, and ensure data is both discoverable and well-governed

The Operate and Observe phases are critical phases that ensure your code not only works but continues to work as expected and remains performant over time. Best practices in the Operate and Observe phase include:

  • Provide always-on analytics
  • Test in production
  • Catch errors before customers do
  • Tolerate and recover from failure
  • Choose your own metrics and measure them religiously
  • Don’t overshoot

Let’s dive into each of these in detail.

Provide always-on analytics

Older data systems frequently required data engineers to take them offline in order to batch import data. At the speed of business and volume of data we deal with today, however, this won't fly. Taking down a report or data-driven application for making critical business decisions or supporting real-time tasks such as fraud detection could cost your business —and your customers —money.

Modern software systems are designed with an always-on architecture. Our data system should be, too.

For analytics, this usually means having multiple data sources and performing replication to secondaries as new data trickles into the primary. Any required downtime for an analytics system should be kept to a necessary minimum and outside of core business hours.

Test in production

Earlier in our series, we emphasized the importance of building unit, data, and integration tests to vet analytics code changes before they made it into customer’s hands. A multi-environment deployment process that verifies changes against test data reduces the risk of shipping data transformation code that results in incorrect data or downtime.

However, pre-production environments can never be exact replicas of production. To protect consumer privacy, pre-production data must be either mock data or a subset of real-world data that's been anonymized. Additionally, there are other data workloads running constantly in production that aren't running in pre-production during testing.

The bottom line is you shouldn't just test in preproduction —you should be testing in production as well. Testing in production is a technique from software that continues running tests on a live environment with real data. By testing within the context of other workloads and incoming, real-world data, you can uncover hidden issues, edge cases, and performance concerns.

Most major tech companies use some form of test in production to validate and monitor their solutions post-deployment. Using dbt Cloud, you can easily set up the tests you already wrote alongside your models to run in prod as well.

Catch errors before customers do

Testing in production helps catch errors before they land in your stakeholders’ laps. Whether a missing data field value, an incorrect value, relational cardinality violation, or other data anomaly, catching issues before customers reduces confusion and miscommunication.

Testing in production, however, is only one part of catching errors. You can catch more errors, more quickly, by implementing multiple safety measures, including:

  • Dashboards for observing metrics
  • Alerts the system can throw on any error it detects
  • Automatic incident creation
  • Tools - such as logging and data lineage - for analyzing issues and finding their root cause
  • A mature process around incident triage and response, including a distributed, around-the-clock staffing model and on-call rotation for support
  • Staff who have clear ownership of and appropriate training in resolving data issues quickly

Tools like dbt Cloud provide some of these functions - such as job notifications, model notifications, and webhooks for communicating failure status - out of the box. This enables adding reliability to your overall data control plane with little additional engineering overhead.

Tolerate and recover from failure

Part of being “always-on” is not going down due to an undetected failure. As your analytics system matures, consider building in additional measures that enable detecting errors, issuing alerts, and restarting once data engineers have deployed a fix.

dbt Cloud bakes this in by enabling job retries from the point of failure. This can be orchestrated via the dbt cloud dashboard, command line, or even automated through an API endpoint. This enables building even more sophisticated error recovery mechanisms that attempt retries after automated error resolution.

Choose your own metrics and measure them religiously

There are a number of metrics to measure the general reliability of your data systems, including uptime, availability, latency, and throughput. There is also a wealth of possible metrics for assessing data quality, which you can use to improve reliability over time across your entire data estate.

It's important to define the metrics that matter to you and your team as part of the ADLC process. Preferably this is something you've done during the Plan phase. Definitions should include both how these metrics are calculated plus their acceptable thresholds. It’s also important, not just to monitor them, but to centralize and document their definitions so they’re accessible to all data stakeholders.

You can also improve your overall data reliability by integrating with third-party tools that specialize in data quality. dbt Cloud integrates with multiple third-party platforms that provide advanced data quality monitoring services, such as ML-based anomaly detection, code impact analysis, and root cause analysis.

Don’t overshoot

Software engineers like to warn that we shouldn't let the perfect be the enemy of the good. Once you have general stability in your analytic systems, it can be tempting to eliminate every last issue.

This is almost always counterproductive. Adding an additional nine of reliability carries exponentially higher costs. This is encapsulated in the 10x9 Rule: For every nine you add, you increase reliability 10x - but at 10x the total cost of your solution.

Determine when you’ve achieved “enough” reliability and when you should rely on your systems and processes to resolve previously undetected errors quickly. Aim for progress, not perfection.

Conclusion

Ensuring quality doesn't end once you ship something to production. Constant testing and monitoring are required to verify your changes work as expected in a real-world context.

The good news is that this creates a virtuous cycle. As you encounter issues in the real world, you can anticipate and test for them in subsequent releases. This makes your analytics code higher quality and more resilient with every release.

In the final installment of the series, we'll look at how to make sure stakeholders can find and benefit from new data products, as well as provide feedback for future releases.


Last modified on: Jan 31, 2025

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts