The Analytics Development Lifecycle: Test
Dec 26, 2024
LearnData quality errors are your company’s worst enemy. At best, they undermine people’s trust in the data that drives the business. At worst, they can provide false information that leads to erroneous - and costly - business decisions.
The Analytics Development Lifecycle (ADLC) aims to create a mature analytics workflow that produces high-quality, frequently updated data with every iteration. A key part of delivering that quality is not just creating tests, but fostering a test-driven culture as part of the ADLC.
We’ll explore how testing fits into the ADLC, the types of tests you should be creating, and how to best manage your testing efforts for maximum positive impact.
Test in the ADLC
The ADLC is a variation of the Software Development Lifecycle (SDLC) that focuses on shipping new or revised data products. Like the SDLC, it breaks down artificial barriers between the different personas that deal with data, treating it as a single, unified process.
In the ADLC, the different personas that handle data - the engineer, the analyst, and the decision-maker—work together to plan, develop, test, deploy, monitor, and use new data products. The process focuses on creating small, well-defined changes and shipping frequently.
We’ve covered how the Plan and Develop phases of the ADLC work. These phases help ensure quality by ensuring that:
- The work done accurately captures business requirements (Plan); and
- All data changes are captured in code, and that code is clean, readable, and reusable (Develop)
The Test phase creates assets that validate that your assumptions about your data and analytics code are correct before pushing a change to production. By testing your data, you can identify issues early in the development lifecycle, preventing expensive rework and downtime down the road.
A good Test phase involves:
- Writing tests for every data asset you own
- Running tests before they’re merged into production
- Continuously testing production data to detect anomalies
Types of tests in the ADLC
Let’s first look at the different types of data tests you’ll want to focus on writing:
- Unit tests
- Data tests
- Integration tests
Unit tests
Unit tests validate small functional portions of your data models and transformations to ensure correctness. They validate your logic on a small set of static inputs before running it on actual data. In data pipelines, this means validating your SQL modeling logic’s correctness.
dbt Cloud supports developing unit tests alongside your SQL models and running them on demand. You don’t need to create a test for every single transformation. However, you should always aim to create unit tests when you have:
- SQL with custom logic
- Reported defects (to verify the fix and prevent regressions)
- Edge cases
- High criticality models, such as organization data sets where a defect could have wide-scale negative impact
For example, this test in dbt verifies that a routine for verifying email addresses captures known edge cases, such as malformed addresses and invalid domain names:
unit_tests:
- name: test_is_valid_email_address
description: "Check my is_valid_email_address logic captures all known edge cases - emails without ., emails without @, and emails from invalid domains."
model: dim_customers
given:
- input: ref('stg_customers')
rows:
- {email: cool@example.com, email_top_level_domain: example.com}
- {email: cool@unknown.com, email_top_level_domain: unknown.com}
- {email: badgmail.com, email_top_level_domain: gmail.com}
- {email: missingdot@gmailcom, email_top_level_domain: gmail.com}
- input: ref('top_level_email_domains')
rows:
- {tld: example.com}
- {tld: gmail.com}
expect:
rows:
- {email: cool@example.com, is_valid_email_address: true}
- {email: cool@unknown.com, is_valid_email_address: false}
- {email: badgmail.com, is_valid_email_address: false}
- {email: missingdot@gmailcom, is_valid_email_address: false}
Data tests
Data tests validate that data transformations are running correctly against the actual data. They verify that:
- The data is current
- The model is sound
- The transformed data is accurate
Data tests usually start by testing basic assumptions about unique and non-null fields (e.g., primary keys), accepted values, and relationships between data. Once you’ve nailed those aspects, you can move on to more proactive tests that focus on verifying freshness and looking for domain-specific problems. For example, if a customer can only have one active subscription to a service, verifying that there aren’t records that violate this constraint.
As with unit tests, you can specify these tests using dbt Cloud and run them with the dbt test command.
Integration tests
Whereas unit tests test one small unit of functionality, integration tests operate against the entire application or project. They ensure your solution works end to end and not merely in isolation.
In the software world, this might involve calling a REST API and ensuring that the REST API endpoint, associated authentication procedures, underlying data stores, connected APIs, etc. all work. In data, you’ll use it most often to test packages, reusable units of analytics code that multiple projects leverage.
In dbt, you can keep unit, data, and integration tests separate by placing them in separate subdirectories. That enables running them at different points of the ADLC.
When to run tests
Anyone who’s creating or updating analytics code - i.e., who’s wearing the engineer hat - is responsible for creating or updating the associated tests. The engineer should make sure to run unit and data tests on their local machine prior to check-in.
As discussed in the Develop phase, engineers should work in their own source control branches. When ready to push to production, they should cut a Pull Request (PR). Another engineer should review their changes - yet another quality control measure - before approving the merge.
The PR should also automatically trigger a run of any associated tests for the change against non-production data in an isolated environment. If the tests don’t fail, the PR should prevent the merge to production until the issue’s resolved.
dbt Cloud supports running tests automatically against a staging schema when it detects a PR has been opened or submitted in your Git provider. You can see this run in either the dbt Cloud dashboard or directly on the PR page of your Git provider, along with any errors that resulted.
Tips for managing testing
Here are a few more tips to get the most out of data testing:
Developing a culture of testing. It’s easy to throw testing by the wayside because you’re busy and you just wanna get something out the door. As our CEO Tristan Handy has written, “The desire to skip writing good tests and move on to the next task is always present and must be balanced via accountability mechanisms like code reviews, linting, and test coverage metrics.”
Get everyone on board with testing as a matter of habit. Set a bar where testing is required for a change and enforce it during PR reviews so that team members hold each other accountable.
Keep the scope of work small. This is a central tenant of the ADLC that’s critical in testing. The larger a change, the harder it is to verify its functional correctness. Conduct training on properly scoping PRs so that all submitted changes contain enough new logic to be useful - but not so much that you can’t verify its accuracy.
Determine your level of test coverage. Decide how much of your analytics code should require testing. In the software field, most teams aim for around 70-80% test coverage. You may need less depending on the complexity of your code.
Once you have a metric for test coverage, monitor it over time to ensure you’re hitting your goal. The dbt Cloud dashboard Recommendations page shows you your overall test coverage as a percentage of how many of your models have defined tests.
Fix or retire “flaky tests.” A flaky test is one that fails intermittently, usually due to some network or environmental condition, or just poorly written logic. Ignoring flaky tests is dangerous because it can foster “alert fatigue,” leading people to tune out and ignore real errors. Either identify the cause of a flaky test and fix it or remove it from your test suite altogether.
Conclusion
The ADLC creates high-quality data sets by making small changes over a series of rapid iterations. Testing verifies quality by making assertions about the state of your data and analytics code.
Since its inception, dbt has supported creating a test-driven culture by building support for testing directly into both dbt models and dbt Cloud. With dbt Cloud as your data control plane, your data teams have a standardized and cost-efficient way to build, test, deploy, and discover analytics code.
In our next installment of this series, we’ll look at how you can leverage dbt Cloud to implement a CI/CD-style approach to deploying analytics code safely to production.
Last modified on: Jan 28, 2025
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.