cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

GUID or concatenated string as a primary key in silver and gold data models

RobTScot
New Contributor

Hi

We are in the process of designing and building new silver and gold layers (Star Schema). We will be using Databricks, which is new to the organisation.

  • The silver layer will be modelled using classic 3NF, with SCD 7.
  • The gold layer is a star schema. We have some facts that will have 9+ dimensions that will require group by for calculating multiple measures.

There is a recommendation by a vendor to use a SHA2 GUID VARCHAR(64) for the primary key of all objects, as DBT is being used for the pipelines. 

There is another proposal to use concatenated strings for the primary keys.

What are the considerations for choosing a primary key data type?

Are there any performance or cost considerations to using a GUID or concatenated string over an Integer data type? 

Thanks

Rob

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @RobTScot ,

I like hash key approach. The hash creates a deterministic fixed length key. Simply concat a bunch of strings will work but in general the length of the resulting key is much wider and not a fixed size.

For gold layer we just use identity column. That's because we use Power BI as our reporting tool of choice and choosing a string for a key is considered bad practice in PBI (the performance would be really bad).

balajij8
Contributor

You can use identity in Databricks

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @RobTScot,

This is a common design decision in lakehouse data modeling, and the right answer depends on the layer, the tooling, and the downstream consumers. Here is a breakdown of the key considerations.


SILVER LAYER (3NF WITH SCD 7)

For silver tables where you are doing SCD tracking, a hash-based surrogate key (like SHA2 over the business key columns) is a solid choice. The main advantages:

1. Deterministic: the same input always produces the same hash, which makes idempotent loads straightforward. You can re-derive the key from the source columns at any time.

2. Fixed length: SHA2-256 produces a consistent 64-character hex string regardless of input. A concatenated string key, by contrast, varies in width depending on the values, which can lead to inconsistencies and makes debugging harder.

3. dbt-friendly: dbt's dbt_utils.generate_surrogate_key() macro uses this pattern, so it integrates cleanly with your pipeline tooling.

4. Handles composite keys well: when your business key is multiple columns (which is common in 3NF), hashing them into a single column simplifies joins downstream.

One practical note: if you go with SHA2, always hash the columns in a consistent, documented order, and handle NULLs explicitly (e.g., coalesce to a sentinel value before hashing). This avoids subtle bugs where NULL ordering changes your hash.


GOLD LAYER (STAR SCHEMA)

For gold/star schema tables, especially fact tables with 9+ dimension keys, the calculus shifts. A few considerations:

1. Integer surrogate keys are significantly more compact. A BIGINT is 8 bytes vs. 64 bytes for a SHA2 VARCHAR(64). When you have 9+ dimension foreign keys on a fact table, that difference multiplies. Smaller rows mean fewer bytes scanned, which directly translates to lower cost and faster queries on Databricks SQL warehouses.

2. Join performance: integer-to-integer joins are faster than string-to-string joins, particularly at scale. Photon (the native vectorized engine in Databricks SQL) is highly optimized for integer operations.

3. BI tool compatibility: if you are using Power BI, Tableau, or similar tools, integer keys perform much better. Power BI in particular is well-known to have significant performance degradation with string-type keys in relationships.

4. Databricks supports identity columns for auto-incrementing surrogates:

CREATE TABLE gold.dim_customer (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY,
customer_id STRING NOT NULL,
customer_name STRING,
...
);

Identity columns use BIGINT and auto-assign unique, incrementing values. One limitation to be aware of: declaring an identity column disables concurrent transactions on that table, so it works best for batch-loaded dimension tables. If you need concurrency, use GENERATED BY DEFAULT AS IDENTITY instead and manage the sequence yourself for parallel loads.


PRIMARY KEY CONSTRAINTS IN DATABRICKS

Regardless of your key choice, you can declare PRIMARY KEY constraints on Unity Catalog tables:

ALTER TABLE gold.dim_customer
ADD CONSTRAINT dim_customer_pk PRIMARY KEY (customer_sk) NOT ENFORCED RELY;

These constraints are informational (not enforced by the engine), but the RELY option enables Photon to use them for query optimization, such as join elimination. This is especially valuable in star schemas where BI tools generate queries that join many dimensions. You should still validate uniqueness in your pipeline logic (dbt tests, quality expectations, etc.).

Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-alter-table-add-constraint.htm...


PERFORMANCE AND COST SUMMARY

Here is how the options compare across key dimensions:

Storage per key column:
- BIGINT identity: 8 bytes
- SHA2 VARCHAR(64): 64 bytes
- Concatenated string: variable, often 20-100+ bytes

Join performance:
- BIGINT: fastest (Photon-optimized integer comparisons)
- SHA2/concatenated string: slower (string comparison, larger memory footprint)

Determinism:
- BIGINT identity: not deterministic (assigned at insert time)
- SHA2: fully deterministic (re-derivable from source columns)
- Concatenated string: deterministic but variable-length

BI tool compatibility:
- BIGINT: excellent across all tools
- String keys: can degrade performance in Power BI, Tableau, etc.

dbt integration:
- SHA2: native support via generate_surrogate_key()
- BIGINT identity: requires additional orchestration for SCD tracking


RECOMMENDED APPROACH

A common pattern that works well on Databricks:

- Silver layer: use SHA2 hash keys as your surrogate/business key identifiers. This aligns with dbt conventions, supports idempotent loads, and handles composite keys cleanly.

- Gold layer: use BIGINT identity columns as surrogate keys on dimension tables, with foreign keys on fact tables referencing those integers. Keep the hash key as an alternate/natural key column on dimensions so you can map back to silver.

This gives you the best of both worlds: deterministic keys for pipeline reliability in silver, and compact integer keys for query performance and BI compatibility in gold.

Reference for identity columns: https://docs.databricks.com/en/delta/generated-columns.html
Reference for constraints: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-constraint.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.