11-01-2022 01:26 PM
We are considering moving to Delta Live tables from a traditional sql-based data warehouse.
Worrying me is this FAQ on identity columns Delta Live Tables frequently asked questions | Databricks on AWS this seems to suggest that we basically can't create unique ids for rows unless streaming and of course a SCD 1 dimension gold table seems like it will never be able to be a streaming table as it is not an append-only (it needs to merge)
Databricks talks up the requirement here .. How to Identity Columns to Generate Surrogate Keys in the Databricks Lakehouse - The Databricks Blog so it is clear how important this is.
What am I missing here?
11-17-2022 03:44 AM
Thanks for the reply
Yes exactly that is concerning me. Ultimately surrogate keys should be stable and this statement suggests that they will not be which seems to defeat the point of them.
Other data products (outside of the delta live pipeline) may be referring to CustomerKey 12 (CustomerCode: BOB1). If, due to a refresh (not sure when this would be necessary, but feels like it will happen) BOB1's key changes to 23 instead, we have just silently broken all external tables with references to our keys. This will not be a good day.
Also consider the SCD2 case. SCD2 tables increasingly benefit from having a Surrogate Key from a meaningless identity column. However if identity with APPLY CHANGES is not supported and APPLY CHANGES is the way that databricks implements SCD2 then how do Databricks suggest a SCD2 table will look and be reliably referenced from other tables?
Do we need to maintain our Key to BusinessCode mappings in a seperate, non-delta-live table to ensure they are stable? If this is the recommendation then I suggest it would be good to add that information to the FAQ (a minimal example would also be great)
01-12-2023 08:47 AM
Did you ever get an answer to this? It seems like a huge hole in the DLT feature set. My current DLT pipeline stops at the point where the data being outputted is distinct and REMOVEd records dropped. From there, I'm using a normal Delta pipeline to scan the table twice
Did you come up with a better solution for this problem?
The other challenge in this is that the output tables have some dimension data that doesn't natively have IDs. I have been forced to create my own dimension tables with their own SKs, then retroactively write the dimension's SK back into the target table. I don't mind doing this, but again the DLT seems ill-equipped to support this.
01-12-2023 07:48 PM
It's amazing how therapeutic it can be to write out your problems. After replying here, the solution came to me. The secret sauce is in getting everything done *before* you run the dlt.apply_changes() engine. After that, all bets are off because the engine seemingly stops worrying about tracking CDC. So before you run apply changes...
make a simple table that takes in only your source data's primary key, or make one via concats as necessary. Then do something like this:
#########################################################
##define the target table's IDs
@dlt.table(comment = "Target for CDC ingestion.",
schema="""
sk_workorder BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
workorder_id STRING
"""
)
def stg_workorder_ids():
return (dlt.read_stream("stg_workorders_hist_bronze")
.select('workorder_id')
.distinct()
)
#########################################################
This gets you a 1:1 relationship between your proprietary key and the SK (surrogate key). Once I built this, I joined it back to my history_silver table, *then* passed that through the apply_changes engine. Seems to work. GL!
07-02-2024 06:59 PM
@TomRenish I tried to implement your solution however, I get an error message when running the code you provide.
There error message says that declared schema:
(sk_workorder, workorder_id)
does not match the inferred schema:
(workorder_id).
I have not found a way to work around this.
Am I missing something?
02-09-2023 03:01 AM
@Brett Gerhardi Did you get a satisfactory solution for this? We are also looking to migrate our warehousing to Databricks, but SCD 2 without effective surrogate keys is.... less than desirable.
@Tom Renish Thank you for offering this. Have you found this works reliably? It just seems strange that such a work around is necessary. Surely surrogate keys for SCD 2 are essential? Or are Databricks expecting us to use inefficient, convoluted joins with combinations of Natural Keys and dates?
02-09-2023 04:51 AM
Frankly the lack of response from DataBricks on this is dissappointing.
Having no official guidance from DataBricks for their flagship premium product on such a fundamental and basic data warehouse concept is concerning.
02-09-2023 07:40 AM
good morning @Kearon McNicol . Yes, this is working reliably. Now let me tell you why I don't like the solution. First though, some caveats...
Having said that, I looked more closely at what the dlt pipeline is doing and what kinds of objects are being created. When the join between the table full of IDs and the latest data takes place, the code quite explicitly calls for a table with "@dlt.table()". In the graphical view however, the outputted object is in fact a materialized view. When the materialized view is being written, it is *not* operating in an append mode! It is instead doing a complete rewrite. My data is mercifully small today with the largest table coming in at 2.3m records, but that's going to change. The full rewrite has me deeply concerned that scalability will become a problem. One solution that occurs to me is to use a spark pipeline instead and rely on keys to do MERGE operations instead.
I am of course open to suggestions on this topic.
02-09-2023 07:46 AM
Thank you. That's helpful.
One of the points of considering Databricks was to make use the elegance of the automated processes (schema detection), STORED AS SCD 2, etc. This could make for a clean code base and a more robust data flow.
However, at this rate, it looks like I would have to use traditional methodologies, so then the question arises - why bother with Databricks?
I'm investigating possibilities for post SCD processing but, so far, it's proving somewhat torturous...
02-09-2023 08:10 AM
I forgot something very important. When that final materialized view is being built, it's based on a dlt.read() rather than dlt.read_stream. For my purposes this is necessary because left joins are involved to brink parent-child table keys together where there aren't always children. When doing a left join, the dlt engine will complain when attempting to use read_stream. I suspect that it can work if using purely inner joins to complete the table.
02-09-2023 08:20 AM
I added a comment above about left joins, please take a look. Re: your question about moving back to traditional methods, I may end up doing a mix where dlt is in play for running everything but the final table in real time, and create a different pipeline in Spark to update final output tables where I can have flexibility with Merge. It's not a pretty solution, but it may check all the boxes where scalability is concerned.
01-15-2023 11:08 PM
Hi @Brett Gerhardi
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
09-05-2024 07:24 AM
I am wondering if there is an answer to this question now.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group