dbt
Blog Guide to surrogate keys

Guide to surrogate keys

Jul 01, 2024

Learn

A surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. You will need to create a surrogate key for every table that doesn't have a natural primary key.

Why would you ever need to make a surrogate key? Shouldn’t all tables innately just have a field that uniquely identifies each row? Now that would be too easy…

Let’s say you have a table with all license plate numbers and the state of the plate. While license plate numbers are unique to their state, there could be duplicate license plate numbers across different states. So by default, there’s no natural key that can uniquely identify each row here. In order to uniquely identify each record in this table, you could create a surrogate key based on the unique combination of license plate number and its state.

Surrogate keys, natural keys, and primary keys oh my!

Primary keys can be established two ways: naturally or derived through the data in a surrogate key.

  • A natural key is a primary key that is innate to the data. Perhaps in some tables there’s a unique id field in each table that would act as the natural key. You can use documentation like entity relationship diagrams (ERDs) to help understand natural keys in APIs or backend application database tables.
  • A surrogate key is a hashed value of multiple fields in a dataset that create a uniqueness constraint on that dataset. You’ll essentially need to make a surrogate key in every table that lacks a natural key.

Note - You may also hear about primary keys being a form of a constraint on a database object. Column constraints are specified in the DDL to create or alter a database object. For data warehouses that support the enforcement of primary key constraints, this means that an error would be raised if a field's uniqueness or non-nullness was broken upon an INSERT or UPDATE statement. Most modern data warehouses don’t support and enforce primary key constraints, so it’s important to have automated testing in-place to ensure your primary keys are unique and not null.

How surrogate keys are created

In analytics engineering, you can generate surrogate keys using a hashing method of your choice. Remember, in order to truly create a uniqueness constraint on a database object, you’ll need to hash the fields together that make each row unique; when you generate a correct surrogate key for a dataset, you’re really establishing the true grain of that dataset.

Let’s take this to an example. Below, there is a table you pull from an ad platform that collects calendar_date, ad_id, and some performance columns.

    • calendar_date

      ad_id

    • impressions

      spend

    • clicks

      conversions

    • 2022-05-16

      212

      2022-05-16

      214

      2022-05-05

      212

    • 88744

      4523.00

      323

      6.49

      125600

      117244.56

    • 9432

      166

      4

      0

      17318

      56

    In this state, this table has no natural key that can act as a primary key. You know the grain of this table: this is showing performance for each ad_id per calendar_date. Therefore, hashing those two fields will create a uniqueness constraint on this table.

    To create a surrogate key for this table using the MD5 function, run the following:

    select 
    	md5(calendar_date || ad_id) as unique_id,
    	*
    from {{ source('ad_platform', 'custom_daily_report')}}

    After executing this, the table would now have the unique_id field now uniquely identifying each row.

      • unique_id

        calendar_date

      • ad_id

        impressions

      • spend

        clicks

      • conversions

      • 62aef884fbe3470ce7d9a92140b09b17

        2022-05-16

        ea385f7a5e560ef4d8a78f7d913927e4

        2022-05-16

        53a33f257d1d4f2446469ac5adad1c0c

        2022-05-05

      • 212

        88744

        214

        323

        212

        125600

      • 4523.00

        9432

        6.49

        4

        117244.56

        17318

      • 166

        0

        56

      Testing surrogate keys

      Amazing, you just made a surrogate key! You can just move on to the next data model, right? No!! It’s critically important to test your surrogate keys for uniqueness and non-null values to ensure that the correct fields were chosen to create the surrogate key.

      In order to test for null and unique values you can utilize code-based data tests like dbt tests, that can check fields for nullness and uniqueness. You can additionally utilize simple SQL queries or unit tests to check if surrogate key count and non-nullness is correct.

      A note on hashing algorithms

      Depending on your data warehouse, there’s several cryptographic hashing options to create surrogate keys. The primary hashing methods include MD5 or other algorithms, like HASH or SHA. Choosing the appropriate hashing function is dependent on your dataset and what your warehouse supports.

        • Hashing algorithm

          Bit length

        • Known collisions?

        • HASH

          64 bits

          MD5

          128 bits

          SHA256

          256 bits

        • Yes, past ~4 billion elements

          No

        Note - A collision occurs when two pieces of data that are different end up hashing to the same value. If a collision occurs, a different hashing method should be used.

        Why we like surrogate keys

        Let’s keep it brief: surrogate keys allow data folks to quickly understand the grain of the database object and are compatible across many different data warehouses.

        Readability

        Because surrogate keys are comprised of the fields that make a uniqueness constraint on the data, you can quickly identify the grain of the data. For example, if you see in your data model that the surrogate key field is created by hashing the ad_id and calendar_date fields, you can immediately know the true grain of the data. When you clearly understand the grain of a database object, this can make for an easier understanding of how entities join together and fan out.

        Compatibility

        Making a surrogate key involves a relatively straightforward usage of SQL: maybe some coalescing, concatenation, and a hashing method. Most, if not all, modern data warehouses support both the ability to concat, coalesce, and hash fields. They may not have the exact same syntax or hashing functions available, but their core functionality is the same.

        Performance concerns for surrogate keys

        In the past, you may have seen surrogate keys take the form of monotonically increasing integers (ex. 1, 2, 3, 4). These surrogate keys were often limited to 4-bit integers that could be indexed quickly. However, in the practice of analytics engineering, surrogate keys derived from the data often take the form of a hashed string value. Given this form, these surrogate keys are not necessarily optimized for performance for large table scans and complex joins. For large data models (millions, billions, trillions of rows) that have surrogate keys, you should materialize them as tables or incremental models to help make joining entities more efficient.

        Conclusion

        Surrogate keys are unique row identifiers that are created by using columns in a database object to create a uniqueness constraint on the data. To create a surrogate key, you will use a cryptographic algorithm usually in the form of the MD5 function to hash together fields that create a uniqueness constraint on the dataset. Ultimately, surrogate keys are a great way to create unique row identifiers for database objects that lack them naturally and allow folks to easily identify the grain of the data.

        Further reading

        Want to learn more about keys, dbt, and everything in-between? Check out the following:

        Last modified on: Nov 27, 2024

        Build trust in data
        Deliver data faster
        Optimize platform costs

        Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

        Read now ›

        Recent Posts