dbt
Blog Data modeling techniques for modern data warehouses

Data modeling techniques for modern data warehouses

Apr 17, 2023

Learn

I remember the first time I showed a business user “under the hood” of the dbt project for our organization:

Them: “Wow! There are so many models and there’s so much work here; I would have never known that because of how intuitive the data is to use in Looker.”

Me: “😅😅...I know”

These data models formed the heart of analytics at our organization—powered every single dashboard in our BI tool— and the majority of business users at that time thought we were conducting magic behind the scenes to get the data in the proper state. The validation of the work and energy that went into those data models from my colleagues who depended on that data was gratifying.

What is data modeling?

Despite how it may appear, data modeling isn't some fantastical process made up of *waves hands airily* magic 🪄. Formally, it’s the process data practitioners use to make raw data into meaningful business entities in their data warehouse.

There’s a lot of SQL (and pointedly not magic) involved during the transformation process. Data teams rely on tried and true techniques to create predictable and scalable data models.

Data modeling shapes raw data into the story of a business, as well as establishes a repeatable process that can help create consistency in a data warehouse: how schemas and tables are structured, models are named, and relationships are constructed. At the end of the day, a solid data modeling process will produce a data warehouse that's navigable and intuitive, with data models that represent the needs of the business.

Data modeling techniques

Data modeling…the perfect balance between art and science, chaos and structure, and pain and joy 😂. At the end of the day, there are many ways to model your data and get it in a place that’s best suited for your business needs. Your downstream use cases, data warehouse, raw source data, and the skills of your data team will all help determine which data modeling approach is most appropriate for your business.

This page is going to cover some of the most common types of data modeling techniques we see used by modern analytics teams (relational, dimensional, entity-relationship, and data vault models), what they are at a high level, and how to unpack which one is most appropriate for your organization.

Relational model

A relational data model is a broad way to represent data and their relationships to other data. In a simple relational model, data is stored in a tabular format (think rows and columns) and connects to other tables via foreign keys.

At the end of the day, almost all of the data modeling techniques we discuss here and see in the real world are a type of relational model. These models are the building blocks of most modeling techniques.

This image depicts a simple relational model between customers, orders, and payments

Relational models themselves don’t necessarily give data teams the best way to scale their data models because of their focus on relationships and not business logic or use cases. Instead, they offer a foundation to build more robust relational modeling processes, such as dimensional modeling, to establish data models that scale in an organized and repeatable manner.

Good fit for:

  • Low-volume data and a small number of data sources
  • Simple use cases for the data
  • Low-medium concern around data warehousing costs (because relational models often need joining to be made meaningful)

Dimensional data model

Dimensional data modeling is a type of relational model that puts entities into two buckets of facts and dimensions. ‌Dimensional modeling is one of the most predominant models used in modern data stacks, as it offers a unique combination of flexibility and constraint.

A fact is a collection of information about an action, event, or result of a business process. Facts typically describe the verbs of your business (ex., account creations, payments, and email events).

A dimension, on the other hand, describes who or what took that action (think: users, accounts, customers, and invoices—the nouns of your business). Together, fact and dimension models form the tables that describe many modern businesses and offer an intuitive way to design the entities in your data warehouse.

Before the rise and wide adoption of cloud data warehouses—where storage is cheap and compute costs become the larger concern—dimensional models were often designed into star or snowflake schemas to concretely define fact and dimension tables. This clean breakup of models ‌aimed to reduce the cost of large, wide tables with many columns.

An image of a simplified star schema

Good fit for:

  • Medium-to-large number of data sources
  • Centralized data teams
  • End-use case for data is primarily around business intelligence and providing insights
  • Teams that want to create an easily navigable and predictable data warehouse design

Read more here on how we like to structure our dbt projects and data models.

Entity-relationship data model

Entity-relationship (ER) data models have entities at the heart of their modeling. ER modeling is a high-level data modeling technique that's based on how entities, their relationships, and attributes connect:

  • Entities: Typically represent core business objects, events, or functions; some examples of an entity could be customers, accounts, and products.
  • Attributes: The columns that describe an entity.
  • Relationships: The actions or relationships that join entities together. For example, customer and product entities can be joined by a purchase action.
An image of a simplified ER model between customers and products

The focus of ER modeling is often around the processes of a business. Relational models can connect any which way, but ER models are meant to form an architecture that's relatively concise and based on actual business procedures (events) and entities.

However, ER modeling can still be vague around long-term database structure. It’s more of a way to show how tables can connect. It's useful for understanding database architecture at a higher level.

To make ER models more meaningful and scalable, data teams often take the foundational learnings from an ER model to form more complex data model architectures like dimensional modeling.

Good fit for:

  • Low-complexity data that connects neatly together
  • Simple, business-focused downstream use cases for the data
  • Central data teams that have deep knowledge of the facets of their data

Many data sources you ingest into your data warehouse via an ETL tool will have ERDs (entity relationship diagrams) that your team can review to better understand how the raw data connects. Slightly different from an ER model itself, ERDs are often used to represent ER models and their cardinality (ex., one-to-one, one-to-many) in a graphical format.

These ERDs will often look a little like the relational model shown earlier. Using these diagrams together with another data modeling technique, such as dimensional modeling, helps data teams efficiently wade through raw data and create business entities of meaning.

Data vault model

A data vault model abstracts entities, their relationship to each other, and their attributes into separate tables called hubs, links, and satellites:

  • Hubs: Contain the unique key of the entity they represent, as well as a hashed representation (as a surrogate key). Some example hubs for a typical DTC shop would potentially be h_customer, h_product, and h_payment.
  • Links: Contains the foreign keys to other hubs. In our example shop above, an h_customer hub would likely have a link to an h_payment hub by having a link containing the customer id and a payment id.
  • Satellites: Contain the attributes or descriptors of an entity. For a h_customer hub, a satellite could contain information like customer shipping address, email, NPS, etc.
An image of a simplified overview of a data vault setup

Data vault architecture was invented to easily track data changes by having an insert-only mindset; whenever a row changes in a classical data model, a new row is either added or the existing row is modified. In a data vault world, data updates are represented only by new rows.

Like other data models that encourage staging and mart layers, data vault architectures also support using raw, business, and information vaults. The consistency in the structure of a data vault encourages data teams to build a layer of organization for their data; this “safety” layer allows some teams to build other types of data models on top of these foundational models.

Data vault modeling can feel repetitive or prescriptive, given the consistent structure of hubs, links, and satellites. With dbt, you can use the dbtvault package to speed up the development time of fundamental data vault models to focus on writing the SQL that matters to your business.

Good fit for:

  • Enterprise teams where the ability to audit data is a primary concern
  • Teams that need flexibility and who want to make large structural changes to their data without causing delays in reporting
  • More technical data teams that can manage and govern the network-like growth of data vault models

The best data modeling technique is a consistent technique

At the end of the day, there’s no clear “right” or “wrong” data modeling technique for your data and business. However, there are ones that are probably more appropriate and aligned with the skillsets of your team. Before you commit to a data modeling technique, ask yourself (and your team members):

  • What are my primary end-use cases for the data?
  • Who needs to be able to query my final data or navigate through my data warehouse?
  • How many data sources am I dealing with? How complex is my data?
  • Which data modeling technique best suits the raw data formats and warehouse I work with?
  • What would governance and standardization look like with this modeling technique?

Like most analytics work, there’s a fine line between rigidity and flexibility with data modeling; you’ll likely never have a full warehouse of perfect dimensional data models that meet Kimball or Inmon’s standards because you adjust these techniques to your data and the needs of the business.

And while data modeling is more art than science at times, there's still a vital need for governance in this practice. Using a data transformation tool that supports any data modeling technique, peer review, integrated testing and documentation, and version control is the first step to governing and creating a consistent process for your data modeling.

Ready to learn more about how dbt can support your data modeling efforts? Take a look at some of the resources below to see how modern data teams are transforming (😉) the way they tackle data modeling with dbt:

Last modified on: Oct 16, 2024

Accelerate speed to insight
Democratize data responsibly
Build trust in data across business

Achieve a 194% ROI with dbt Cloud. Access the Total Economic Impact™️ study to learn how. Download now ›

Recent Posts