Data quality dimensions: What they are and how to incorporate them
Jun 30, 2024
InsightsPoor-quality data can kill a data-driven project before it even starts. Unfortunately, data quality remains a struggle across all industries.
Eight years ago, the situation was dire. A Harvard Business Review study conducted in 2017 found that only 3% of corporate data met basic quality standards. While things have improved, Gartner still estimates that companies lose USD $12 million annually to poor-quality data.
Improving data quality requires a proactive approach. This, in turn, requires knowing how to measure data quality and what the trade-offs are. We'll look at data quality dimensions, why they're vital to improving data quality, and discuss the key quality dimensions along with some examples.
What are data quality dimensions?
A dimension is some aspect of your data that has meaning or value to your business.
Breaking down data quality into dimensions enables you to determine which aspects are most important to your teams and the company as a whole. This enables you to do several things:
- Create a comprehensive framework for data quality
- Measure and establish a baseline for quality across teams so you can focus on the areas of greatest impact
- Compare data quality across teams using common terminology and definitions
Data quality dimensions comprise one part of an overall data quality framework that should also include data pipelines, data cleansing rules, data governance rules, data quality monitoring, and data quality tools.
What are the key data quality dimensions?
While there are many different ways to categorize data quality dimensions, one useful taxonomy is usefulness, accuracy, completeness, consistency, uniqueness, validity, and freshness.
Most taxonomies leave off usefulness. We view defining data's use and business value early on as a key part of the Analytics Development Lifecycle (ADLC). Other breakdowns may look at data quality dimensions from slightly different angles—there's no single way to think about this. What's important is to have a consistent and useful taxonomy.
Let's take a look at each category of data quality dimensions, along with some examples.
Usefulness
We feel people often leave usefulness out because it's the hardest category to define. You can define usefulness by answering the question: is the data generating value for the business?
One way to measure this is through the overall presence of dark data in your company. Dark data is all the data in your company that's lying dormant, unused. Sadly, for most companies, this is the majority of their assets. One survey by Splunk estimates as much as 55% of any company's data might be dark.
Dark data generates no revenue or business value. Worse, it loses money, as you have to pay for the compute and storage you use to transform and preserve the data.
You can identify existing dark data by tracking data usage statistics—for example, by using a data catalog. You can then take two approaches with this data:
- Address any issues and data quality, discoverability, or documentation that led it to become dark in the first place; or
- Obsolete the data, shutting down its transformation pipelines and storing existing data in a cheaper form of cold storage (e.g., Amazon S3 Glacier storage classes).
Accuracy
Accuracy defines how closely a data result matches existing reality.
As data practitioners, it’s your job to ensure that the data you expose to end users is accurate, as in it contains the values that reflect reality. If your business sold 198 new subscriptions today, 198 should be represented in your raw data.
Accuracy issues can arise due to conflicting upstream data sources, out-of-date data, buggy analytics code, or other technical issues. You can identify inaccurate data currently in your system using techniques such as statistical analysis, data profiling, consistency checks, spot checking, and sampling.
The best approach, of course, is to disallow any accuracies in the first place. You can get closer to this ideal state by using a data transformation framework like dbt to implement data tests that verify all values for a sample data set are correct after transformation.
Completeness
Completeness means you have all the required records—and fields within those records—needed to answer a given set of business questions. In other words, completeness is another dimension that should be defined during the planning stage of the Analytics Development Lifecycle.
As with accuracy, you can measure completeness using techniques such as data profiling, sampling, and attribute profiling.
You can also use tests to check for completeness in data transformation — e.g., checking that key fields don’t have null or empty values. If you implement tests in dbt, you can speed up this process by implementing generic data tests that data engineers can reuse across projects.
The following test, for example, verifies that a given value is not null. (This is just an example; dbt supports four generic tests out of the box, including not_null, unique, accepted_values, and relationships.)
{% test not_null(model, column_name) %}
select *
from {{ model }}
where {{ column_name }} is null
{% endtest %}
Consistency
Consistency measures whether data is self-consistent and whether it remains consistent across upstream and downstream data sources as it passes through the data lifecycle. Data consistency may also be called data integrity.
Consistency can have severe negative consequences depending on the data. Consider, for example, patient medical systems. An inconsistency across systems in something like a patient ID can result in overlapping or missing records, such as a patient’s list of allergies or currently prescribed medications. A decision made with such data could put someone’s life at risk.
You can use data transformation tools like dbt to track and improve consistency. dbt is built upon the concept of a DAG, or a directed acyclic graph. When someone makes changes to upstream models—say a condition on a case when a statement changed, so there’s a new value for a user_bucket column—you can rebuild your table with the new column value and impact all downstream cases of user_bucket thanks to dbt’s reference capabilities.
dbt also supports the use of metrics via its Semantic Layer, allowing you to define a core metric once (in a version-controlled, code-based environment). You can then expose that same core metric across your database and BI tools. At the end of the month, the CFO and COO shouldn’t have two different values for ARR. With a semantic layer approach to metrics, you can govern and maintain key KPIs in a way that creates unparalleled consistency.
Uniqueness
Uniqueness ensures that data is non-duplicative. It prevents the havoc that can be caused by having multiple records, each with slightly different information.
You can test for uniqueness primarily by defining good primary keys and enforcing unique values. However, the challenge is maintaining uniqueness across different data systems. A data transformation system like dbt can help with this by encouraging the reuse of critical data models across projects.
Validity
Validity ensures that a data value is correct for its column type. For some values, this can mean the value is in an accepted range—e.g., an integer representing a month should only have values between 1 and 12. For string values, it might require checking the format of structured data—such as verifying a field contains a properly-formatted ZIP code, well-formatted JSON, or a valid GUID.
Validity issues can also cross data fields. For example, a person marked as living in a row probably shouldn’t have a birth date more than 120 years in the past.
You can use accepted value tests to ensure invalid values don’t enter your data stores. For example, the following dbt test checks that a refund dollar amount isn’t less than 0:
-- Refunds have a negative amount, so the total amount should always be >= 0.
-- Therefore return records where total_amount < 0 to make the test fail.
select
order_id,
sum(amount) as total_amount
from {{ ref('fct_payments') }}
group by 1
having total_amount < 0
To detect validity issues in existing data, you can run data audits that apply these same checks to historical data and calculate a validity score. As with all data issues, engineers can then use data lineage to find and fix these issues at their upstream source.
Freshness
Also known as timeliness, freshness measures data that has been updated within a target timeframe.
The “acceptable” Service Level Agreement (SLA) for data KPIs will differ depending on the use case. For example, you may only need new sales data sent to a data warehouse table to fuel a sales report every week, meaning a 1 day SLA will suffice. By contrast, an open product order should update whenever the order status changes; since this can happen several times daily, you’ll likely need a 1 hour SLA.
You can measure freshness via data freshness (how recently data was updated) and data latency (how long it takes for an upstream change to propagate downstream). You can implement data freshness reports easily in dbt Cloud by checking a checkbox on your data transformation jobs.
Creating high-quality data across your enterprise
Defining data quality dimensions and associated metrics can give you a sense of where your data quality weak points are. With these metrics in hand, you can identify the improvements that will have the most immediate impact on your overall data quality.
The challenge is implementing this new approach to data quality consistently. In most companies, every team takes its own ad hoc approach to data transformation. That results in inconsistencies in quality from one project to the next.
dbt Cloud is your data control plane for data. It provides a single, uniform, and vendor-agnostic approach to data transformation you can use to model, test, and verify data across your data ecosystem, no matter where it lives. Using dbt Cloud, you can give teams an easy-to-use toolset for guaranteeing high-quality data without locking yourself into a specific vendor or data architecture.
Learn more about how dbt Cloud can improve data quality by contacting us for a demo today.
Last modified on: Mar 10, 2025
The playbook for building scalable analytics
Learn how industry-leading data teams build scalable analytics programs that drive impact. Download the Data Leaders eBook today.
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.