cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Surrogate Keys with Delta Live

bgerhardi
New Contributor III

We are considering moving to Delta Live tables from a traditional sql-based data warehouse.

Worrying me is this FAQ on identity columns Delta Live Tables frequently asked questions | Databricks on AWS this seems to suggest that we basically can't create unique ids for rows unless streaming and of course a SCD 1 dimension gold table seems like it will never be able to be a streaming table as it is not an append-only (it needs to merge)

Databricks talks up the requirement here .. How to Identity Columns to Generate Surrogate Keys in the Databricks Lakehouse - The Databricks Blog so it is clear how important this is. 

What am I missing here?

12 REPLIES 12

Kaniz_Fatma
Community Manager
Community Manager

Hi @Brett Gerhardi​ , Can you explain what exactly is concerning you?

Is this the problem, or are you looking for a specific solution?

Identity columns can be specified in the schema definition of a CREATE LIVE TABLE statement in the same way as the standard CREATE TABLE syntax. However, for live tables that are not defined as STREAMING, each refresh may completely regenerate the table, resulting in a new assignment of identity values to each row. Identity columns are not supported with APPLY CHANGES tables.

Thanks for the reply

Yes exactly that is concerning me. Ultimately surrogate keys should be stable and this statement suggests that they will not be which seems to defeat the point of them.

Other data products (outside of the delta live pipeline) may be referring to CustomerKey 12 (CustomerCode: BOB1). If, due to a refresh (not sure when this would be necessary, but feels like it will happen) BOB1's key changes to 23 instead, we have just silently broken all external tables with references to our keys. This will not be a good day.

Also consider the SCD2 case. SCD2 tables increasingly benefit from having a Surrogate Key from a meaningless identity column. However if identity with APPLY CHANGES is not supported and APPLY CHANGES is the way that databricks implements SCD2 then how do Databricks suggest a SCD2 table will look and be reliably referenced from other tables?

Do we need to maintain our Key to BusinessCode mappings in a seperate, non-delta-live table to ensure they are stable? If this is the recommendation then I suggest it would be good to add that information to the FAQ (a minimal example would also be great)

TomRenish
New Contributor III

Did you ever get an answer to this? It seems like a huge hole in the DLT feature set. My current DLT pipeline stops at the point where the data being outputted is distinct and REMOVEd records dropped. From there, I'm using a normal Delta pipeline to scan the table twice

  1. created a table with the source data's proprietary alpha-numeric IDs and an identity column (my new surrogate key, aka SK) as part of the table's DDL, where I append in new alpha IDs from the DLT where the alpha ID doesn't already exist in this target table. This ensures the sk to alpha ID relationship is immortalized
  2. created a table to warehouse all the data and joined the table above to it so that the output has a SK. The problem with this is that it's not a read_stream() operation so I'm literally doing a full overwrite. Very, very suboptimal...

Did you come up with a better solution for this problem?

The other challenge in this is that the output tables have some dimension data that doesn't natively have IDs. I have been forced to create my own dimension tables with their own SKs, then retroactively write the dimension's SK back into the target table. I don't mind doing this, but again the DLT seems ill-equipped to support this.

TomRenish
New Contributor III

It's amazing how therapeutic it can be to write out your problems. After replying here, the solution came to me. The secret sauce is in getting everything done *before* you run the dlt.apply_changes() engine. After that, all bets are off because the engine seemingly stops worrying about tracking CDC. So before you run apply changes...

make a simple table that takes in only your source data's primary key, or make one via concats as necessary. Then do something like this:

#########################################################

##define the target table's IDs

@dlt.table(comment = "Target for CDC ingestion.",

      schema="""

         sk_workorder BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),

         workorder_id STRING

         """

     )

def stg_workorder_ids():

 return (dlt.read_stream("stg_workorders_hist_bronze")

     .select('workorder_id')

     .distinct()

     )

#########################################################

This gets you a 1:1 relationship between your proprietary key and the SK (surrogate key). Once I built this, I joined it back to my history_silver table, *then* passed that through the apply_changes engine. Seems to work. GL!

@TomRenish  I tried to implement your solution however, I get an error message when running the code you provide.

There error message says that declared schema:

(sk_workorder, workorder_id)

 

does not match the inferred schema:

(workorder_id).

I have not found a way to work around this.

Am I missing something?

Kearon
New Contributor III

@Brett Gerhardi​ Did you get a satisfactory solution for this? We are also looking to migrate our warehousing to Databricks, but SCD 2 without effective surrogate keys is.... less than desirable.

@Tom Renish​ Thank you for offering this. Have you found this works reliably? It just seems strange that such a work around is necessary. Surely surrogate keys for SCD 2 are essential? Or are Databricks expecting us to use inefficient, convoluted joins with combinations of Natural Keys and dates?

bgerhardi
New Contributor III

Frankly the lack of response from DataBricks on this is dissappointing.

Having no official guidance from DataBricks for their flagship premium product on such a fundamental and basic data warehouse concept is concerning.

TomRenish
New Contributor III

good morning @Kearon McNicol​ . Yes, this is working reliably. Now let me tell you why I don't like the solution. First though, some caveats...

  1. The data I'm working with is customer-service oriented. The implication here is that a given record's field values are prone to change. Simpler data such as logs have no such concern
  2. I don't consider myself to be an expert. It's plausible that my solution is dumb

Having said that, I looked more closely at what the dlt pipeline is doing and what kinds of objects are being created. When the join between the table full of IDs and the latest data takes place, the code quite explicitly calls for a table with "@dlt.table()". In the graphical view however, the outputted object is in fact a materialized view. When the materialized view is being written, it is *not* operating in an append mode! It is instead doing a complete rewrite. My data is mercifully small today with the largest table coming in at 2.3m records, but that's going to change. The full rewrite has me deeply concerned that scalability will become a problem. One solution that occurs to me is to use a spark pipeline instead and rely on keys to do MERGE operations instead.

I am of course open to suggestions on this topic.

Kearon
New Contributor III

Thank you. That's helpful.

One of the points of considering Databricks was to make use the elegance of the automated processes (schema detection), STORED AS SCD 2, etc. This could make for a clean code base and a more robust data flow.

However, at this rate, it looks like I would have to use traditional methodologies, so then the question arises - why bother with Databricks?

I'm investigating possibilities for post SCD processing but, so far, it's proving somewhat torturous...

TomRenish
New Contributor III

I forgot something very important. When that final materialized view is being built, it's based on a dlt.read() rather than dlt.read_stream. For my purposes this is necessary because left joins are involved to brink parent-child table keys together where there aren't always children. When doing a left join, the dlt engine will complain when attempting to use read_stream. I suspect that it can work if using purely inner joins to complete the table.

TomRenish
New Contributor III

I added a comment above about left joins, please take a look. Re: your question about moving back to traditional methods, I may end up doing a mix where dlt is in play for running everything but the final table in real time, and create a different pipeline in Spark to update final output tables where I can have flexibility with Merge. It's not a pretty solution, but it may check all the boxes where scalability is concerned.

Anonymous
Not applicable

Hi @Brett Gerhardi​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!