2 weeks ago
Hello,
I have a question related to APPLY AS DELETE WHEN...
If the source table does not have a column that specifies whether a record was deleted, I am currently using a workaround by ingesting synthetic data with a soft_deletion flag.
In the future, there will be updates to automatically handle deletes? For me, a delete means that an id does not exist in the bronze table but still exists in the target (silver or gold).
I’m curious to know how other teams are handling this scenario. Do you implement similar workarounds, or do you have another best practice for managing deletes in Delta Live Tables?
2 weeks ago
Hello,
It's fairly common to not have the SCD col/tag in bronze layer and adding it in silver, as the granularity in bronze (often) don't have collisions on unique ID's and it is rarely something you care about in bronze. Then in silver you either add a valid_from and valid_to, where if valid_to is None, you know it's the active version, OR you have a valid flag and initial date. Both of these will give you history and easy overview of current values.
Automatically propagating deletions (and more) can be quite well handled with Change Data Feed, although it does require some logic and setup.
When you say the bronze does not have information weather it was deleted - when you get a new bronze file, how do you determine a row was deleted? Do you get full loads and can see which are not there, or do you deduce it in some other way?
2 weeks ago
I get a full load, so let me simplify. I have a per_id and a name. I create another bronze_table_extended and with one additional column is_deleted. For already ingested records in bronze, I mark is_deleted = 'N'.
Then I compare the IDs from silver (with valid_to = 9999-12-31, meaning current records) against the IDs from bronze. If a per_id exists in silver but not in bronze, I insert that per_id into bronze with the rest of the columns set to NULL, and the flag is_deleted = 'Y'.
2 weeks ago
Hmm, CDC is not built for full loads, so don't think that is the best option here.
I can clearly see the logic behind you're setup, but also see there's a possibility for optimization and cleanliness, as you're doing a lot of comparisons just to figure out what is deleted. But I'm not sure how much you can do differently when getting full loads.
Is the bronze full load always active? As I read your answer, If a data point exists in silver and not in bronze, it is deleted. If it exists in both, it is kept. If it is in bronze but not in silver, it is inserted. So as I see it, bronze always contains the most up to date data, correct? Have you considered using overwrite - perhaps with replaceWhere clause so you only select non-history data?
2 weeks ago - last edited 2 weeks ago
To be honest, @dalcuovidiu scenario is not that uncommon. The main problem here is that in a Declarative Pipeline, apply as deleted expects an attribute that indicated whether particular record should be deleted. But it would be great to have an option like we have in merge statement -> WHEN NOT MATCHED BY SOURCE THEN DELETE
2 weeks ago
Yep, I was expecting something similar to MERGE to avoid this kind of workaround. Maybe it will be released in the future.
2 weeks ago
Exactly, I hope that in future such feature will be added.
2 weeks ago
I’d need more details about your data source and the pipeline you’ve planned, but here’s a scenario based on a few best practices to help frame the discussion. When the source doesn’t send a delete flag, you still need a reliable way to reflect deletions in your Delta tables. Below are practical patterns that work well on Databricks.
WHAT TO PRIORITIZE
Real CDC (best case)
Propagate op = 'D' from the source and apply it downstream.
-- DLT example
APPLY CHANGES INTO LIVE.silver_customers
FROM STREAM(LIVE.bronze_cdc)
KEYS (id) SEQUENCE BY op_ts
APPLY AS DELETE WHEN op IN ('D');
Full snapshot (no CDC)
Use MERGE with NOT MATCHED BY SOURCE. Prefer soft delete.
MERGE INTO silver.customers t
USING staging.curr_snapshot s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET t.is_deleted = true, t.deleted_at = current_timestamp();
Incremental without a delete flag (classic case)
Generate synthetic tombstones via an anti-join between the current set of keys and the target’s active keys.
-- Current keys from the latest batch
CREATE OR REPLACE TEMP VIEW curr_keys AS
SELECT DISTINCT id FROM bronze_latest_customers;
-- Active keys in the target
CREATE OR REPLACE TEMP VIEW tgt_keys AS
SELECT id FROM silver.customers WHERE is_deleted = false;
-- Keys that disappeared => tombstones
CREATE OR REPLACE TEMP VIEW tombstones AS
SELECT t.id, true AS is_deleted, current_timestamp() AS deleted_at
FROM tgt_keys t LEFT ANTI JOIN curr_keys c ON t.id = c.id;
-- Apply real changes + tombstones
MERGE INTO silver.customers tgt
USING (
SELECT * FROM staging.upserts -- real inserts/updates
UNION ALL
SELECT id, /* nulls for other cols */, true, current_timestamp() FROM tombstones
) chg ON tgt.id = chg.id
WHEN MATCHED AND chg.is_deleted THEN
UPDATE SET tgt.is_deleted = true, tgt.deleted_at = chg.deleted_at
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED AND COALESCE(chg.is_deleted,false)=false THEN INSERT *;
Note: In DLT, you can materialize tombstones and use APPLY AS DELETE WHEN is_deleted to keep everything within DLT.
DELTA BEST PRACTICES
• Default to soft deletes in Silver/Gold (is_deleted, deleted_at); do physical deletes only when required.
• Add a grace period/watermark to avoid false deletes from late-arriving data.
• Data quality: expectations for unique id, caps on delete percentage per batch; log metrics.
• Change Data Feed (CDF) for downstream propagation/auditing:
ALTER TABLE silver.customers SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
• Performance: OPTIMIZE ... ZORDER BY (id) and Deletion Vectors (if enabled):
ALTER TABLE silver.customers SET TBLPROPERTIES (delta.enableDeletionVectors = true);
• Observability & governance: record batch counts; use DLT event_log / system tables.
• Idempotency: every batch should be safely re-runnable.
TL;DR
• If you have CDC, use APPLY AS DELETE.
• If the source delivers a full snapshot, use MERGE with NOT MATCHED BY SOURCE (soft delete).
• If it’s incremental without native deletes, create tombstones via anti-join and apply via MERGE/DLT.
I hope this was helpful.
2 weeks ago
Hi @dalcuovidiu ,
See if this makes sense to you: given a full-snapshot source without a delete flag, keep Bronze immutable, materialize a deduplicated current snapshot in staging (e.g., via ROW_NUMBER() on ingest_ts per id), and apply an idempotent Delta MERGE in Silver with NOT MATCHED BY SOURCE to mark soft deletes (is_deleted/deleted_at), guarded by a watermark/grace window for late data, CDF enabled for downstream propagation, performance guardrails (partitioning/Z-ORDER on id, Deletion Vectors), and alert thresholds to prevent flapping.
2 weeks ago
Ok, I will give it a try. I was reading the documentation , and it’s not very clear about the most common use case—when there isn’t a dedicated column that indicates a record was deleted in the source system. In my case, I just created such a column myself to continue with auto cdc.
2 weeks ago
ok. In my case I am qualified for: Incremental without a delete flag (classic case)
Generate synthetic tombstones via an anti-join between the current set of keys and the target’s active keys.
I don't want to use Merge, that's why my question was for CDC in the context of Declarative Pipelines.
Thanks,
Ovidiu
a week ago
This is the way.
If your source system cannot deliver delete-images, you have to do a compare with a full (PK-only) extract.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now