cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Identity column has null values

yit337
Contributor

I want to update a dimension table in the gold model from a silver table by using  create_auto_cdc_from_snapshot_flow and SCD2. In the target table, I have defined an IDENTITY column, which should be populated automatically.

The dlt flow runs successfully, but then all the values in the identity column are NULLs. Why?!

1 ACCEPTED SOLUTION

Accepted Solutions

pradeep_singh
Contributor

Identity columns have certain limitations with SDP . Here is what the documentation says . 

  • Identity columns are not supported with tables that are the target of AUTO CDC processing.

Because identity columns arenโ€™t supported on targets of AUTO CDC (create_auto_cdc_from_snapshot_flow/APPLY CHANGES), the engine wonโ€™t generate values during the SCD2 updatesโ€”hence NULLs in the identity column.

If you can, use a deterministic surrogate key (for example, a hash of the natural/business key) so SCD2 can stay stable across merges

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

View solution in original post

2 REPLIES 2

pradeep_singh
Contributor

Identity columns have certain limitations with SDP . Here is what the documentation says . 

  • Identity columns are not supported with tables that are the target of AUTO CDC processing.

Because identity columns arenโ€™t supported on targets of AUTO CDC (create_auto_cdc_from_snapshot_flow/APPLY CHANGES), the engine wonโ€™t generate values during the SCD2 updatesโ€”hence NULLs in the identity column.

If you can, use a deterministic surrogate key (for example, a hash of the natural/business key) so SCD2 can stay stable across merges

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @yit337,

The reason your identity column values are NULL is that the target table created by create_auto_cdc_from_snapshot_flow is a streaming table, and streaming tables do not support identity columns. This is a documented limitation:

https://docs.databricks.com/en/ldp/developer/ldp-sql-ref-create-streaming-table.html#limitations

The limitation states: "Generated columns, identity columns, and default columns are not supported" on streaming tables.

When you use create_auto_cdc_from_snapshot_flow (or any of the AUTO CDC / apply_changes APIs), the target must be a streaming table created with dlt.create_streaming_table(). Since identity columns are not supported on that table type, the column exists in the schema but never gets auto-populated, which results in NULLs.

WORKAROUND OPTIONS

1. USE A SURROGATE KEY FUNCTION INSTEAD

Replace the identity column with a deterministic surrogate key. You can generate a unique key using one of these approaches:

-- Option A: hash-based key from your business keys
SELECT md5(concat_ws('|', key_col1, key_col2)) AS surrogate_key, ...

-- Option B: monotonically_increasing_id() in a downstream step
SELECT monotonically_increasing_id() AS surrogate_key, ...

You would compute this in your source query or in a downstream materialized view / Delta table that reads from the streaming table.

2. WRITE THE SCD2 RESULT TO A REGULAR DELTA TABLE AS A DOWNSTREAM STEP

Keep your create_auto_cdc_from_snapshot_flow pipeline as-is (without the identity column on the streaming table). Then create a separate Delta table with the identity column defined and populate it from the streaming table output:

CREATE TABLE gold.my_dimension (
sk BIGINT GENERATED ALWAYS AS IDENTITY,
business_key STRING,
attribute1 STRING,
-- other columns
__START_AT TIMESTAMP,
__END_AT TIMESTAMP
);

INSERT INTO gold.my_dimension (business_key, attribute1, __START_AT, __END_AT)
SELECT business_key, attribute1, __START_AT, __END_AT
FROM your_streaming_table_output;

Because this is a regular Delta table (not a streaming table), the identity column values will be auto-generated on insert.

3. USE ROW_NUMBER() IN A MATERIALIZED VIEW

If you need a sequential surrogate key for your star schema dimension and want it managed within the pipeline, you can create a materialized view downstream:

CREATE OR REFRESH MATERIALIZED VIEW gold.my_dimension AS
SELECT
row_number() OVER (ORDER BY business_key, __START_AT) AS sk,
*
FROM your_streaming_table;

Note that row_number() values may shift if the underlying data changes, so this works best when the data is stable. For a truly stable surrogate key, the hash-based approach (Option 1) or the separate Delta table with an identity column (Option 2) is more reliable.

SUMMARY

The core issue is that identity columns and streaming tables are incompatible. Since the AUTO CDC APIs require a streaming table target, the identity column silently produces NULLs. The recommended path is to either use a hash-based surrogate key directly in the pipeline, or add a downstream regular Delta table where the identity column can function as expected.

For more context, you may also find your related question about streaming tables and Gold layer star schemas relevant:
https://community.databricks.com/t5/data-engineering/are-streaming-tables-suitable-for-gold-layer-st...

Documentation references:
- Streaming table limitations: https://docs.databricks.com/en/ldp/developer/ldp-sql-ref-create-streaming-table.html#limitations
- Identity columns: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html
- AUTO CDC from snapshot: https://docs.databricks.com/en/ldp/cdc.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.