- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-22-2025 02:42 PM
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2025 07:23 AM - edited 10-23-2025 07:25 AM
Your solution #1 is very frustrating to me as well, for a number of reasons. Simply put, we have to be able to compare incoming data to target data for normal ETL operations.
One way around this is to create a view of your target silver table, outside of your pipeline (this part is key), and compare against that view. The pipeline will give you a warning telling you that you're being naughty, but it will let you do the comparison anyways. We have had to do this and it does work, although it's clearly not a desirable solution to have to maintain this set of views, but it would allow you to perform the deletion check.
Example: `create or replace view silver.vw_target_table as select * from silver.target_table;`
Someone else might be able to chime in with a more elegant solution, but that's what I've got for you at least!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2025 10:47 AM
The issue you're experiencing is a common challenge in Delta Live Tables (DLT) when implementing mixed refresh patterns (weekly full refresh + daily incremental updates) with schema evolution. The "__enzyme__row__id" column and schema mismatch errors are indicators of DLT's internal tracking mechanisms conflicting with your conditional logic.
Root Cause Analysis
The core problem stems from DLT's expectation of consistent schema across pipeline runs. When you switch between full catch-up mode (with is_deleted flag logic) and normal incremental mode, you're essentially creating two different schemas:
-
Full catch-up mode: Includes additional columns and logic for deleted record detection
-
Normal mode: Standard upsert without deleted record tracking
The "__enzyme__row__id" column is DLT's internal row tracking mechanism , and schema mismatches occur when DLT expects consistent column structures between runs.
Recommended Solutions
Solution 1: Consistent Schema Approach (Recommended)
Always include the is_deleted column in your silver table schema, regardless of the refresh type:
@Dlt.table
def silver_table():
df = dlt.read("bronze_table")
# Always add is_deleted column, defaulting to False for normal runs
if not spark.conf.get("delete_missing_rows", "false").lower() == "true":
df = df.withColumn("is_deleted", lit(False))
else:
# Your existing full catch-up logic here
pass
return df
Solution 2: Separate Tables Strategy
Create separate silver tables for different refresh patterns:
-
silver_table_incremental: For daily updates -
silver_table_full: For weekly full refreshes with delete tracking
Then use a view or downstream table to union them appropriately.
Solution 3: Schema Evolution Configuration
Enable automatic schema evolution in your DLT pipeline configuration:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
However, use this cautiously as it can mask unintended schema changes.
Best Practices for Your Use Case
Weekly Full Refresh Strategy
-
Use DLT's built-in full refresh capability rather than custom logic
-
Schedule separate pipelines for full vs incremental refreshes
-
Leverage Change Data Feed (CDF) for deleted record tracking
Handling Deleted Records
Consider implementing a more robust deleted record detection strategy:
@Dlt.table
def silver_table_with_deletes():
current_data = dlt.read("bronze_latest")
if is_full_refresh_run():
# Compare with previous state
previous_data = dlt.read("bronze_previous")
deleted_records = detect_deleted_records(previous_data, current_data)
return current_data.union(deleted_records.withColumn("is_deleted", lit(True)))
else:
return current_data.withColumn("is_deleted", lit(False))
Implementation Recommendations
Option A: Redesign with Consistent Schema
-
Always include
is_deletedcolumn in silver table -
Use pipeline parameters to control delete detection logic
-
Maintain schema consistency across all runs
Option B: Use DLT's Native Capabilities
-
Leverage
apply_changeswithapply_as_deletesparameter -
Enable Change Data Feed on source tables
-
Use SCD Type 1 for complete record removal
Option C: Dual Pipeline Approach
-
Separate pipeline for weekly full refresh with delete detection
-
Daily incremental pipeline for standard upserts
-
Downstream merge process to combine results
Avoiding Full Refresh Requirements
To prevent the need for full refreshes due to schema changes:
-
Define complete schema upfront including optional columns
-
Use schema evolution settings appropriate for your use case
-
Test schema changes in development before production deployment
-
Consider using streaming tables instead of materialized views for more flexibility
The most sustainable approach would be Option A (consistent schema) combined with DLT's native delete handling capabilities, as it maintains schema consistency while providing the flexibility you need for both refresh patterns.