โ10-22-2024 11:01 AM
I am working on a Delta Live Tables (DLT) pipeline with Unity Catalog, where we are reading data from Azure Data Lake Storage (ADLS) and creating a table in the silver layer with Slowly Changing Dimensions (SCD) Type 1 enabled. In addition, we are adding four audit columns:
One of the main challenges I'm facing is ensuring that the __CreatedDateTime remains unchanged when existing records are updated. My approach is to:
While implementing this, I encountered the following warnings and errors:
Warning:
"Your query 'employee' reads from '<catalog name>.<schema name>.employee' but must read from 'LIVE.employee' instead. Always use the LIVE keyword when referencing tables from the same pipeline so that DLT can track the dependencies in the pipeline."
Error:
"The downstream table 'employee' is referenced when creating the upstream table or view 'employee_stg'. Circular dependencies are not supported in a DLT pipeline. Please remove the dependency between 'employee_stg' and 'employee'."
These issues seem to arise due to a circular dependency in the pipeline when trying to compare the new records with the existing table.
My Code Looks like somethings -
# Cloud file options for CSV files
cloud_file_options_csv = {
"cloudFiles.format": "csv",
"header": "true"
}
# Define the path to your files in ADLS
file_path = "path_to_adls_location/*"
@dlt.view(name="employee_stg")
def stg_employee():
df = (
spark.readStream
.format("cloudFiles")
.options(**cloud_file_options_csv)
.load(file_path)
)
from datetime import datetime
use_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
audit_handler = AuditHandler(use_time)
final_df = audit_handler.addAuditColumns(df, ["ID"])
if spark._jsparkSession.catalog().tableExists("catalog_name.schema_name.silver_employee"):
existing_df = spark.sql("SELECT * FROM catalog_name.schema_name.silver_employee")
# or existing_df = spark.read.table("catalog_name.schema_name.silver_employee")
# or existing_df = spark.readStream.table("catalog_name.schema_name.silver_employee")
# I tried with all above approaches also
final_df = update_created_datetime(existing_df, final_df, ["ID"])
return final_df
dlt.create_streaming_table(
name="silver_employee",
comment="Silver table for employee data with SCD Type 1."
)
dlt.apply_changes(
target="silver_employee",
source="employee_stg",
stored_as_scd_type=1,
keys=["ID"],
sequence_by="DATE"
)
Error -
โ10-31-2024 11:59 AM
Hi @yvishal519
employee_stg
view should not directly reference the silver_employee
table. Instead, the apply_changes
function will handle the merging logic.LIVE
Keyword: Ensure that the source table in apply_changes
is referenced with the LIVE
keyword.__CreatedDateTime
and __UpdatedDateTime
columns are managed within the apply_changes
function to ensure correct handling of timestamps.This approach should help you avoid the circular dependency issue and correctly manage the audit columns in your DLT pipeline.
Docs - https://docs.databricks.com/en/delta-live-tables/cdc.html
Thanks!
โ01-02-2025 01:17 PM
I don't see where or how the __CreatedDateTime and __UpdateDateTime columns are managed within the apply_changes function. I understand __Start_At and __End_At when Type 2 is used but am not seeing what you imply. Would you expand on this in your reply?
Audit Columns: The __CreatedDateTime and __UpdatedDateTime columns are managed within the apply_changes function to ensure correct handling of timestamps.
โ01-06-2025 09:57 AM
Hi @NandiniN - I did not mention you in the above reply though had meant to. I would appreciate if you had ability to provide a follow up.
"I don't see where or how the __CreatedDateTime and __UpdateDateTime columns are managed within the apply_changes function. I understand __Start_At and __End_At when Type 2 is used but am not seeing what you imply. Would you expand on this in your reply?
Audit Columns: The __CreatedDateTime and __UpdatedDateTime columns are managed within the apply_changes function to ensure correct handling of timestamps."
โ01-08-2025 03:14 PM
Hi @RBlum ,
The original author of this ticket had mentioned.
n addition, we are adding four audit columns:
โ01-13-2025 10:32 AM
Hi @yvishal519
Did you end up finding that the " __CreatedDateTime and __UpdatedDateTime columns are managed as expected within the apply_changes function to ensure correct handling of timestamps." as @NandiniN had replied originally OR are you managing the values in your own function outside of apply_changes? Either way can you provide some details in examples?
This has been confusing to me as I find that the lack of Create Date Time and Update Date Time within the apply_changes.
a month ago
Hi @RBlum , what's your use case? Are you not able to find __CreatedDateTime and __UpdatedDateTime columns and that's what your concern is, or some other challenge?
a month ago
When a SCD Type 1 is processed using the APPLY CHANGES INTO (or APPLY FROM SNAPSHOT) APIs there are not columns like the "__CreatedDateTime" and "__UpdatedDateTime" automatically created in the structure and maintained through the APPLY.
That is unlike the SCD Type 2 where "__START_AT" and "__END_AT" columns are automatically created in the table structure and maintained through the APPLY.
The use case for having the SCD Type 1 including the something like an automated __START_AT (original creation of the record .. which does not change) and __UPDATE_AT (from latest update of the record) is that it would allow simple identification of what records were first created and last changed which can be utilized downstream or to summarize the activity within the object since previously run.
4 weeks ago
I havenโt found an ideal solution for handling audit columns effectively in Databricks Delta Live Tables (DLT) when implementing SCD Type 1. It seems thereโs no straightforward way to incorporate these columns into the apply_changes function for this scenario.
That said, I completely agree with the previous response from @RBlum this scenario is indeed feasible when implementing SCD Type 2 with Databricks DLT.
If anyone has insights or workarounds for handling audit columns in SCD1, Iโd love to hear your thoughts!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group