Identity column for streaming target tables using declarative pipeline

Andolina
New Contributor III

Hi All,

I have few bronze streaming tables from which I want to load into silver target tables based on some joins. I am doing this by creating a streaming table and using a view to retrieve fields from each bronze table. After that I am using AUTO-CDC to perform CDC logic. I need to populate Identity columns in our silver tables which will be our surrogate keys for gold layer.

Its mentioned in the document that identity columns cannot be used with AUTO-CDC target:https://learn.microsoft.com/en-us/azure/databricks/ldp/limitations

Is there any way to populate Identity column for such target table? Any guidance would be appreciated!

Thanks,

Andolina

cernst
New Contributor III

Wouldn't you want to use the natural key from the source? Typically auto cdc use cases (be they Snowflake, Big Query, Fabric, Databricks, or other) are those which are cloning an external source using its source ID, and then when subsequent information is desired, a downstream transformation takes place. In situations where that's not sufficient, then typically it's some form of structured streaming with a message queue relay in between or some other path. It sounds like your use case is just mis-assigning transformation layers.

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @Andolina ,

Indeed, prefer using natural keys in your use case. Identity columns have limitations, it is better to use natural keys if available. 

Alternatively, you can generate surrogate key from the row values itself (with hash or concat function). 

If identity columns are strictly necessary, you'll have to generate them in the bronze table, and put it into theTRACK HISTORY ON in AUTO CDC so that changes on that column are not tracked.

Hope it helps

 

Andolina
New Contributor III

Thank you both for replying. We are indeed using the natural keys from source for auto-cdc keys. The Identity column is not used in auto-cdc but needs to be populated as an added column in my silver tables. Users need that to join multiple tables in gold layer. Now the problem I am facing is with auto-cdc I am not able to populate that directly in silver as there is no such option. 

However, I found an alternative way where we create the tables upfront in a separate script. Looks like I cannot do this in python as dp.create_streaming_table() didn't support in my case (may be there is some way here too), but I am able to achieve it using a .sql script. The .sql script needs to be part of the declarative pipeline though.

Example:

CREATE OR REPLACE STREAMING TABLE <tablename>
(
key_column BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1),
..
)

The issue I have now is I am not aware of any potential risk of doing this yet. Any insight would be helpful.
@aleksandra_ch - Thank you for the "Track History On" option. Let me check on that. I only populate that column while generating silver. Do you think I can set this option on the identity column?


Thanks,
Andolina

 

aleksandra_ch
Databricks Employee
Databricks Employee

@Andolina ,

Potential risks of identity columns are:

  • Concurrent writes are not allowed. That is, if multiple flows write to the table with generated identity columns, they will either run sequentially or fail on conflict.
  • There is a slight overhead when inserting values into the table with identity columns.

Yes you can put identity column into the TRACK HISTORY ON parameter.

Best regards,

View solution in original post