Databricks Community

Surya2 · yesterday

Summary

I'm exploring GDPR delete propagation through a medallion architecture (Bronze → Silver → Gold) using Auto CDC with Change Data Feed. Delete events propagate successfully from Landing → Bronze, but fail to propagate from Bronze → Silver → Gold, even though deletes exist in Bronze's CDF.

Architecture

Landing (regular Delta + CDF enabled)

→ Bronze (Auto CDC + CDF enabled)

→ Silver (Auto CDC + CDF enabled)

→ Gold (Auto CDC target)

What Works

✅ Landing → Bronze: Delete events propagate correctly
✅ Bronze CDF contains delete events (verified via table_changes())
✅ Update events propagate through all layers

What Doesn't Work

❌ Bronze → Silver: Delete events are NOT captured when Silver reads Bronze's CDF
❌ Silver → Gold: Delete events are NOT captured when Gold reads Silver's CDF

Question

Is chaining Auto CDC flows via streaming CDF reads a supported pattern for propagating deletes?

Specifically: When an Auto CDC target table has CDF enabled, and a downstream layer reads that table's CDF using .option("readChangeFeed", "true"), do streaming reads emit delete events?

Louis_Frolio · yesterday

Hi @Surya2 ,

Nice write-up. The symptom you're describing, where updates propagate cleanly but deletes quietly disappear, is a common one, and the good news is that the pattern you're after is fully supported. The break is almost certainly in how your downstream layer consumes the upstream change feed, not in how AUTO CDC handles deletes. Let me walk through it.

The short version is yes. Chaining AUTO CDC flows by having a downstream layer read the upstream target's change data feed via .option("readChangeFeed", "true") is a supported pattern, and streaming CDF reads do emit delete events. The docs are explicit that you can read a change data feed from a streaming table that is the target of AUTO CDC (or AUTO CDC FROM SNAPSHOT) the same way you'd read the feed from any other Delta table.

Two things are worth confirming before anything else. Your upstream AUTO CDC target needs to be a Unity Catalog streaming table, and you need to be on Databricks Runtime 15.2 or above. If either of those isn't true, that's the first thing to fix.

Now to the heart of it. Because updates propagate and only deletes fail, the cause is one of two things, and a single check tells them apart. Inspect your downstream (Silver) source view and confirm whether it actually carries the CDF metadata columns: _change_type, _commit_version, and _commit_timestamp.

If those columns are present, you're in the first case. Silver is reading the change feed correctly, but the AUTO CDC flow isn't translating _change_type = 'delete' rows into actual deletes. A delete event arrives carrying the row's values, and without an apply_as_deletes condition, AUTO CDC treats it as an ordinary upsert and effectively re-inserts the row. Update events (update_postimage) upsert correctly, which is exactly why your updates work and your deletes don't. The fix is to add the apply_as_deletes condition, shown below.

If those columns are missing, you're in the second case. The readChangeFeed option isn't being honored, and the read has fallen back to a normal, append-oriented streaming read, so delete events never reach Silver at all. Make sure the option is set, make sure you aren't using skipChangeCommits (it explicitly ignores source updates and deletes), and verify the columns then appear. There's also an internal pipeline setting some teams have reached for when reader options get dropped (pipelines.streamingFlowReadOptionsEnabled=true), but that one isn't in the public docs, so treat it as a last resort rather than a documented fix.

Here's the pattern that resolves both cases:

from pyspark import pipelines as dp
from pyspark.sql.functions import col, expr

@dp.temporary_view()
def bronze_cdf():
    return (
        spark.readStream
             .option("readChangeFeed", "true")             # second case: must be honored
             .table("catalog.schema.bronze_table")
             .where("_change_type != 'update_preimage'")   # drop preimages so they don't resurrect rows
    )

dp.create_streaming_table("silver_table")

dp.create_auto_cdc_flow(
    target             = "silver_table",
    source             = "bronze_cdf",
    keys               = ["id"],
    sequence_by        = col("_commit_version"),            # order out-of-order changes (or _commit_timestamp)
    apply_as_deletes   = expr("_change_type = 'delete'"),   # first case: the usual missing piece
    except_column_list = ["_change_type", "_commit_version", "_commit_timestamp"],
    stored_as_scd_type = 1,                                 # Type 1 = row physically removed (needed for GDPR)
)

The equivalent key clauses in SQL are APPLY AS DELETE WHEN _change_type = 'delete', SEQUENCE BY _commit_version, and STORED AS SCD TYPE 1. Apply the same shape again for Silver to Gold.

A few things to verify as you go. Confirm DBR 15.2+ and Unity Catalog for every intermediate AUTO CDC target. You already have CDF enabled on each layer, just keep in mind that CDF is retained only for a window, so very old deletes can age out. Use SCD Type 1, not Type 2, since Type 2 only marks the row expired (a tombstone) while GDPR removal needs Type 1's physical delete. Verify the CDF metadata columns actually show up in each downstream source view, since that's the fastest way to know which case you're in. And never use skipChangeCommits on these reads, because it silently drops the very deletes you're trying to propagate.

So the pattern works. Since your deletes are visible in Bronze's CDF but aren't reaching Silver, the issue is in how Silver consumes Bronze. Either it isn't reading the CDF as CDF (second case), or it's reading it but not mapping _change_type = 'delete' into apply_as_deletes (first case). Once that's corrected, deletes will propagate through Silver and Gold the same way your updates already do.

Regards, Louis.

References

Advanced AUTO CDC topics, reading CDF from AUTO CDC targets. https://docs.databricks.com/aws/en/ldp/cdc-advanced
The AUTO CDC APIs: Simplify change data capture with pipelines. https://docs.databricks.com/aws/en/ldp/cdc
create_auto_cdc_flow (Python reference). https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-apply-changes
Use Delta Lake change data feed on Databricks. https://docs.databricks.com/aws/en/delta/delta-change-data-feed
Propagating Deletes: Managing Data Removal using Delta Live Tables (Community technical blog). https://community.databricks.com/t5/technical-blog/propagating-deletes-managing-data-removal-using-d...

View solution in original post

Louis_Frolio · yesterday

Hi @Surya2 ,

Nice write-up. The symptom you're describing, where updates propagate cleanly but deletes quietly disappear, is a common one, and the good news is that the pattern you're after is fully supported. The break is almost certainly in how your downstream layer consumes the upstream change feed, not in how AUTO CDC handles deletes. Let me walk through it.

The short version is yes. Chaining AUTO CDC flows by having a downstream layer read the upstream target's change data feed via .option("readChangeFeed", "true") is a supported pattern, and streaming CDF reads do emit delete events. The docs are explicit that you can read a change data feed from a streaming table that is the target of AUTO CDC (or AUTO CDC FROM SNAPSHOT) the same way you'd read the feed from any other Delta table.

Two things are worth confirming before anything else. Your upstream AUTO CDC target needs to be a Unity Catalog streaming table, and you need to be on Databricks Runtime 15.2 or above. If either of those isn't true, that's the first thing to fix.

Now to the heart of it. Because updates propagate and only deletes fail, the cause is one of two things, and a single check tells them apart. Inspect your downstream (Silver) source view and confirm whether it actually carries the CDF metadata columns: _change_type, _commit_version, and _commit_timestamp.

If those columns are present, you're in the first case. Silver is reading the change feed correctly, but the AUTO CDC flow isn't translating _change_type = 'delete' rows into actual deletes. A delete event arrives carrying the row's values, and without an apply_as_deletes condition, AUTO CDC treats it as an ordinary upsert and effectively re-inserts the row. Update events (update_postimage) upsert correctly, which is exactly why your updates work and your deletes don't. The fix is to add the apply_as_deletes condition, shown below.

If those columns are missing, you're in the second case. The readChangeFeed option isn't being honored, and the read has fallen back to a normal, append-oriented streaming read, so delete events never reach Silver at all. Make sure the option is set, make sure you aren't using skipChangeCommits (it explicitly ignores source updates and deletes), and verify the columns then appear. There's also an internal pipeline setting some teams have reached for when reader options get dropped (pipelines.streamingFlowReadOptionsEnabled=true), but that one isn't in the public docs, so treat it as a last resort rather than a documented fix.

Here's the pattern that resolves both cases:

from pyspark import pipelines as dp
from pyspark.sql.functions import col, expr

@dp.temporary_view()
def bronze_cdf():
    return (
        spark.readStream
             .option("readChangeFeed", "true")             # second case: must be honored
             .table("catalog.schema.bronze_table")
             .where("_change_type != 'update_preimage'")   # drop preimages so they don't resurrect rows
    )

dp.create_streaming_table("silver_table")

dp.create_auto_cdc_flow(
    target             = "silver_table",
    source             = "bronze_cdf",
    keys               = ["id"],
    sequence_by        = col("_commit_version"),            # order out-of-order changes (or _commit_timestamp)
    apply_as_deletes   = expr("_change_type = 'delete'"),   # first case: the usual missing piece
    except_column_list = ["_change_type", "_commit_version", "_commit_timestamp"],
    stored_as_scd_type = 1,                                 # Type 1 = row physically removed (needed for GDPR)
)

The equivalent key clauses in SQL are APPLY AS DELETE WHEN _change_type = 'delete', SEQUENCE BY _commit_version, and STORED AS SCD TYPE 1. Apply the same shape again for Silver to Gold.

A few things to verify as you go. Confirm DBR 15.2+ and Unity Catalog for every intermediate AUTO CDC target. You already have CDF enabled on each layer, just keep in mind that CDF is retained only for a window, so very old deletes can age out. Use SCD Type 1, not Type 2, since Type 2 only marks the row expired (a tombstone) while GDPR removal needs Type 1's physical delete. Verify the CDF metadata columns actually show up in each downstream source view, since that's the fastest way to know which case you're in. And never use skipChangeCommits on these reads, because it silently drops the very deletes you're trying to propagate.

So the pattern works. Since your deletes are visible in Bronze's CDF but aren't reaching Silver, the issue is in how Silver consumes Bronze. Either it isn't reading the CDF as CDF (second case), or it's reading it but not mapping _change_type = 'delete' into apply_as_deletes (first case). Once that's corrected, deletes will propagate through Silver and Gold the same way your updates already do.

Regards, Louis.

References

Advanced AUTO CDC topics, reading CDF from AUTO CDC targets. https://docs.databricks.com/aws/en/ldp/cdc-advanced
The AUTO CDC APIs: Simplify change data capture with pipelines. https://docs.databricks.com/aws/en/ldp/cdc
create_auto_cdc_flow (Python reference). https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-apply-changes
Use Delta Lake change data feed on Databricks. https://docs.databricks.com/aws/en/delta/delta-change-data-feed
Propagating Deletes: Managing Data Removal using Delta Live Tables (Community technical blog). https://community.databricks.com/t5/technical-blog/propagating-deletes-managing-data-removal-using-d...

Surya2 · 14 hours ago

Hi Louis @Louis_Frolio

Thank you very much for your comprehensive troubleshooting guidance. The references you shared, particularly the technical blog post on "Propagating Deletes..." were extremely helpful and contained information I had missed earlier.

Following the design pattern recommended in that blog post, I successfully implemented a POC: Bronze streaming table with CDF enabled → Silver streaming table that consumes CDF change records → Gold materialized view.

I'm happy to adopt this pattern going forward.

I've set aside my original design pattern, which was based on my initial intuition to leverage CDF at each layer of the medallion architecture (Bronze → Silver → Gold, all streaming tables with CDF enabled and Auto CDC applied at each hop).

However, for learning purposes, I revisited my original multi-hop CDC pattern to understand why it didn't work. As you suggested, I verified the following:

✅All tables (landing, bronze, silver, gold) are Unity Catalog tables
✅Databricks Runtime 18.2
✅CDF metadata columns (_change_type, _commit_version, _commit_timestamp) appear in each downstream source view
✅Both silver and gold Auto CDC flows include apply_as_deletes=expr("_change_type = 'delete'")
✅Not using skipChangeCommits

Despite confirming all the above, deletes still did not propagate beyond Bronze (to Silver/Gold), while updates propagated successfully through all layers.

It appears that CDF-based delete propagation through Auto CDC is supported for single-hop scenarios only (e.g., Landing → Bronze OR Bronze → Silver), but not for multi-hop chaining (Landing → Bronze → Silver → Gold).

I may be missing something, but this appears to be a limitation when chaining multiple Auto CDC flows that read from upstream CDF-enabled streaming tables. Even the design pattern that was recommended in the Databricks technical blog as referred above uses only one CDF consumption hop (Bronze with CDF → Silver consumes CDF). In my testing, deletes work perfectly for the first hop (Landing → Bronze) but fail at subsequent hops (Bronze → Silver, Silver → Gold). Update events propagate through all layers, but delete events stop after the first Auto CDC application

Regardless, I've learned valuable patterns from this exploration and will proceed with the recommended single-hop CDF design.

Your help in this regard is much appreciated. Thank you again!