cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Surrogate Keys with Delta Live

bgerhardi
New Contributor III

We are considering moving to Delta Live tables from a traditional sql-based data warehouse.

Worrying me is this FAQ on identity columns Delta Live Tables frequently asked questions | Databricks on AWS this seems to suggest that we basically can't create unique ids for rows unless streaming and of course a SCD 1 dimension gold table seems like it will never be able to be a streaming table as it is not an append-only (it needs to merge)

Databricks talks up the requirement here .. How to Identity Columns to Generate Surrogate Keys in the Databricks Lakehouse - The Databricks Blog so it is clear how important this is. 

What am I missing here?

12 REPLIES 12

Thanks for the reply

Yes exactly that is concerning me. Ultimately surrogate keys should be stable and this statement suggests that they will not be which seems to defeat the point of them.

Other data products (outside of the delta live pipeline) may be referring to CustomerKey 12 (CustomerCode: BOB1). If, due to a refresh (not sure when this would be necessary, but feels like it will happen) BOB1's key changes to 23 instead, we have just silently broken all external tables with references to our keys. This will not be a good day.

Also consider the SCD2 case. SCD2 tables increasingly benefit from having a Surrogate Key from a meaningless identity column. However if identity with APPLY CHANGES is not supported and APPLY CHANGES is the way that databricks implements SCD2 then how do Databricks suggest a SCD2 table will look and be reliably referenced from other tables?

Do we need to maintain our Key to BusinessCode mappings in a seperate, non-delta-live table to ensure they are stable? If this is the recommendation then I suggest it would be good to add that information to the FAQ (a minimal example would also be great)

TomRenish
New Contributor III

Did you ever get an answer to this? It seems like a huge hole in the DLT feature set. My current DLT pipeline stops at the point where the data being outputted is distinct and REMOVEd records dropped. From there, I'm using a normal Delta pipeline to scan the table twice

  1. created a table with the source data's proprietary alpha-numeric IDs and an identity column (my new surrogate key, aka SK) as part of the table's DDL, where I append in new alpha IDs from the DLT where the alpha ID doesn't already exist in this target table. This ensures the sk to alpha ID relationship is immortalized
  2. created a table to warehouse all the data and joined the table above to it so that the output has a SK. The problem with this is that it's not a read_stream() operation so I'm literally doing a full overwrite. Very, very suboptimal...

Did you come up with a better solution for this problem?

The other challenge in this is that the output tables have some dimension data that doesn't natively have IDs. I have been forced to create my own dimension tables with their own SKs, then retroactively write the dimension's SK back into the target table. I don't mind doing this, but again the DLT seems ill-equipped to support this.

TomRenish
New Contributor III

It's amazing how therapeutic it can be to write out your problems. After replying here, the solution came to me. The secret sauce is in getting everything done *before* you run the dlt.apply_changes() engine. After that, all bets are off because the engine seemingly stops worrying about tracking CDC. So before you run apply changes...

make a simple table that takes in only your source data's primary key, or make one via concats as necessary. Then do something like this:

#########################################################

##define the target table's IDs

@dlt.table(comment = "Target for CDC ingestion.",

      schema="""

         sk_workorder BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),

         workorder_id STRING

         """

     )

def stg_workorder_ids():

 return (dlt.read_stream("stg_workorders_hist_bronze")

     .select('workorder_id')

     .distinct()

     )

#########################################################

This gets you a 1:1 relationship between your proprietary key and the SK (surrogate key). Once I built this, I joined it back to my history_silver table, *then* passed that through the apply_changes engine. Seems to work. GL!

@TomRenish  I tried to implement your solution however, I get an error message when running the code you provide.

There error message says that declared schema:

(sk_workorder, workorder_id)

 

does not match the inferred schema:

(workorder_id).

I have not found a way to work around this.

Am I missing something?

Kearon
New Contributor III

@Brett Gerhardiโ€‹ Did you get a satisfactory solution for this? We are also looking to migrate our warehousing to Databricks, but SCD 2 without effective surrogate keys is.... less than desirable.

@Tom Renishโ€‹ Thank you for offering this. Have you found this works reliably? It just seems strange that such a work around is necessary. Surely surrogate keys for SCD 2 are essential? Or are Databricks expecting us to use inefficient, convoluted joins with combinations of Natural Keys and dates?

bgerhardi
New Contributor III

Frankly the lack of response from DataBricks on this is dissappointing.

Having no official guidance from DataBricks for their flagship premium product on such a fundamental and basic data warehouse concept is concerning.

TomRenish
New Contributor III

good morning @Kearon McNicolโ€‹ . Yes, this is working reliably. Now let me tell you why I don't like the solution. First though, some caveats...

  1. The data I'm working with is customer-service oriented. The implication here is that a given record's field values are prone to change. Simpler data such as logs have no such concern
  2. I don't consider myself to be an expert. It's plausible that my solution is dumb

Having said that, I looked more closely at what the dlt pipeline is doing and what kinds of objects are being created. When the join between the table full of IDs and the latest data takes place, the code quite explicitly calls for a table with "@dlt.table()". In the graphical view however, the outputted object is in fact a materialized view. When the materialized view is being written, it is *not* operating in an append mode! It is instead doing a complete rewrite. My data is mercifully small today with the largest table coming in at 2.3m records, but that's going to change. The full rewrite has me deeply concerned that scalability will become a problem. One solution that occurs to me is to use a spark pipeline instead and rely on keys to do MERGE operations instead.

I am of course open to suggestions on this topic.

Kearon
New Contributor III

Thank you. That's helpful.

One of the points of considering Databricks was to make use the elegance of the automated processes (schema detection), STORED AS SCD 2, etc. This could make for a clean code base and a more robust data flow.

However, at this rate, it looks like I would have to use traditional methodologies, so then the question arises - why bother with Databricks?

I'm investigating possibilities for post SCD processing but, so far, it's proving somewhat torturous...

TomRenish
New Contributor III

I forgot something very important. When that final materialized view is being built, it's based on a dlt.read() rather than dlt.read_stream. For my purposes this is necessary because left joins are involved to brink parent-child table keys together where there aren't always children. When doing a left join, the dlt engine will complain when attempting to use read_stream. I suspect that it can work if using purely inner joins to complete the table.

TomRenish
New Contributor III

I added a comment above about left joins, please take a look. Re: your question about moving back to traditional methods, I may end up doing a mix where dlt is in play for running everything but the final table in real time, and create a different pipeline in Spark to update final output tables where I can have flexibility with Merge. It's not a pretty solution, but it may check all the boxes where scalability is concerned.

Anonymous
Not applicable

Hi @Brett Gerhardiโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Pelle123
New Contributor II

I am wondering if there is an answer to this question now.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group