Data quality metrics: What to pay attention to
Jun 14, 2024
LearnData quality is the cornerstone of an organization’s data systems. Low-quality data can undermine trust in data, leaving employees feeling lost and rudderless. By contrast, high-quality data fosters trust, encourages quick and confident decision-making, and helps drive new data projects that increase business value.
Data quality metrics provide a quantitative measure of data quality, enabling your organization to identify gaps and improve quality over time. However, defining a framework for data quality takes time—and requires changing how your organization works with data on a daily basis.
We’ll look at the dimensions of data quality, the best metrics for measuring it, and the tools and processes you can use to measure, improve, and maintain the quality of your data.
Dimensions of data quality
Researchers into data quality define it according to several different factors, or dimensions of data. These can include the following categories and metrics:
Accuracy
A combination of accuracy of the data (i.e., is it free of error? Does it correspond to reality?), believability, and completeness.
Freshness
A measure of data timeliness, or whether the data is recent enough for a given business purpose.
Usability
A combination of understandability, interpretability, and accessibility.
Security and compliance
These measures can include whether the data is available only to authorized users, how much data is tagged with sensitivity labels, and whether that tagging is accurate.
What matters in data quality
The above is a brief summary of the dimensions you might utilize when measuring data quality. In addition, your team may face trade-offs when defining key performance indicators for specific metrics.
For example, defining a low metric (like minutes or seconds) for timeliness might mean you can’t achieve the results you want for the accuracy or completeness of your data. Or, your attempts to make data more accessible and usable across the organization may conflict with your need to keep certain sensitive data sets secured against unauthorized access.
The nature of your data and its business use will drive not just what metrics you need but how much importance you attach to them. For example, when Rocket Money needed a quote-to-cash system to ensure accuracy of their financial data, they placed additional weight on accuracy as a metric (and spent more time in testing to guarantee it). The team had an obligation, not just to ensure the numbers were accurate, but to prove they were accurate.
Measuring data quality
While dimensions help to identify data quality attributes, they’re not metrics. To effectively utilize data quality dimensions as a data quality framework, you first need to identify the associated metrics you want to capture.
Here’s an incomplete list of examples. You may use these alongside other metrics to get a full picture of a given data set's quality and overall value to the business:
- Metrics related to data incidents such as total number of data incidents, time to detection, time to resolution, table health (number of incidents/table) (Accuracy)
- Number of empty/incomplete values, data transformation error rates (Accuracy, Completeness)
- Hours since last data refresh, data ingestion delay, tables with the most recent/oldest data, min/max/average data delays (Timeliness)
- Number of tests passed/failed over time (Accuracy, Completeness)
- Data importance score, number of users of a given table/asset/query, percentage of “dark” or unused data (Accessibility, Believability, Interpretability, Understandability)
- Dashboard uptime, table uptime, data storage costs, time to value (Accessibility)
- Number of assets tagged for sensitivity levels, number of data-related security incidents (Security)
Once you’ve identified the metrics you want to capture, you need some way to hold yourself and the organization accountable for meeting them. This requires defining key performance indicators (KPIs) for your data and then verifying them through data quality testing.
For example, for an empty/incomplete values test, you could set a threshold at which point a record is considered incomplete—e.g., if 10% of its values are missing. This value may fluctuate depending on your use case. A voluntary customer survey may have 50% or more values missing and still contain worthwhile data. On the other hand, critical financial data could be considered incomplete if even 2% of its total values are missing.
Data quality tools
Implementing a rigorous data quality metrics framework that works requires more than defining metrics and writing tests. It requires both creating a data quality culture and adapting the right technology and tools.
A data quality culture at the organizational level ensures that all teams are aligned on the connection between value and quality. This is usually laid out as a set of definitions and principles—e.g., “we prioritize accuracy in order to maintain client trust.” A set of common KPIs, along with associated definitions of data quality, is also crucial to ensure everyone is assessing and measuring quality in the same way.
Another component of a data quality culture is making data quality an integral part of everyone’s daily workflow. This can comprise measures such as requiring comprehensive tests for every new data set, setting standards and goals for data classification, providing publicly available data quality dashboards, and streamlining the processing for reporting and resolving data errors.
Poor data quality at any point in a data set’s lifecycle can cause downstream failures in reports and data-driven applications. Such issues undermine everyone’s trust in the data. Over time, that decreases its use and its value. A data quality culture emphasizes that everyone who works with data is responsible for ensuring its accuracy and usefulness.
Technology and tools are also critical in ensuring data quality. There are three ways in which tools help drive data quality:
Data catalogs
Tools such as a data catalog enable everyone in a company to find what they need by serving as a single source of truth for an organization’s data. Data catalogs leverage metadata—such as ownership, date last modified, data types, description, etc. —to help users both find and understand data.
Data lineage
Data lineage shows how data moves through an organization. It shows business users where a given piece of data comes from so they can have confidence in the data’s origins and accuracy. It also enables data engineers to find and resolve data quality errors at their source.
Data testing
Data quality tests ensure that extracted and transformed data conforms to business users’ expectations and understanding of how the data should look. A robust data testing framework should run tests automatically every time new data is imported into the system to ensure it’s complete and error-free.
Data quality with dbt
dbt is a data transformation tool that provides a framework for modeling, transforming, and storing data. With dbt, you can define data models that combine data from various sources into useful data products that business stakeholders can use to drive decision-making.
At dbt Labs, we’ve long recognized how critical it is to maintain high-quality data sets. That’s why we’ve built multiple mechanisms for verifying quality with every data import. These include:
Test framework
Using dbt, you can define data tests to accompany your models using a combination of SQL and Jinja templating. You can define both single data tests, which test a specific table and set of records; or generic data tests, which implement general checks applicable to multiple data sets (e.g., a not-null test for a column in a table).
Continuous Integration
Preventing errors from creeping into your data requires continuous testing. Using dbt Cloud’s Continuous Integration (CI) feature, you can configure a migration pipeline that re-tests your data models with every code change.
A check-in to source control or a Pull Request (PR) completion triggers a build and runs all of the relevant tests you’ve written automatically in a staging environment. If the staging build completes without error and your tests all pass, you can promote the change to production, running your tests again to verify you haven’t injected any new errors into production data. If you do detect errors, you can roll back your changes to the previous version of your model while you identify the root cause.
Monitoring and alerting
You can use dbt Cloud to monitor all of your data transformation jobs. You can raise alerts on failure, monitor source freshness, retry failed jobs, and send notifications and alerts on both successful and failed job runs.
Conclusion
The metrics you use to define and measure data quality will be specific to your organization—and will likely even differ between teams and projects. No matter the metrics you select, establishing a data quality culture and selecting the right tools to measure data quality are essential to building a robust, repeatable process that makes data quality an organizational priority.
See how tools like dbt Cloud can help you create a data quality culture—contact us for a demo today.
Last modified on: Oct 28, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.