SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ravipal-global,

I have seen this pattern before. The behavior you are seeing is expected. Let me explain why it happens and then walk through several approaches that can help you achieve delete-and-reload without requiring a full refresh of your entire pipeline.


WHY SILVER AND GOLD DO NOT PICK UP THE RELOADED DATA

Your pipeline follows the pattern: S3 -> Auto Loader -> Bronze -> Silver (stream) -> Gold (stream)

When you delete records from bronze and Auto Loader re-ingests the renamed files, two things happen at the bronze Delta table level:

1. The DELETE operation creates a "non-append" commit in the Delta log (a commit that removes data).
2. The re-ingested data creates a new append commit.

The silver streaming table reads bronze as a Delta streaming source. By default in Lakeflow Spark Declarative Pipelines (SDP) (formerly Delta Live Tables / SDP), downstream streaming reads use skipChangeCommits behavior, which means the stream ignores any Delta commit that involves deletes or updates. This is designed to keep append-only streams safe from breaking on upstream modifications.

The tricky part is that even though the re-appended data is technically a new append commit, the stream checkpoint may have already advanced past the version where the delete happened, and the interaction between the delete commit and the subsequent append commit can cause the stream to treat the reloaded records as part of a change operation rather than a clean new append. This is why the silver table seems to "ignore" the reloaded data while still picking up genuinely new data that arrives later.

Documentation on skipChangeCommits behavior: https://docs.databricks.com/en/structured-streaming/delta-lake.html


OPTION 1: SELECTIVE FULL REFRESH OF SPECIFIC TABLES (RECOMMENDED)

You do not need to full-refresh your entire pipeline. Lakeflow Spark Declarative Pipelines (SDP) supports refreshing individual tables selectively.

From the Pipeline UI:
1. Open your pipeline
2. Click the dropdown arrow next to the "Start" button
3. Select "Full refresh selection"
4. Choose only the specific bronze, silver, and gold tables that need the reload

Via the REST API:

POST /api/2.0/pipelines/{pipeline_id}/updates
{
"full_refresh_selection": [
"my_schema.bronze_table",
"my_schema.silver_table",
"my_schema.gold_table"
]
}

This clears data and checkpoints for only the selected tables and reprocesses them from their sources, leaving all other tables in your pipeline untouched. Since you mentioned this is only needed occasionally for a subset of data, selectively refreshing the affected chain of tables (bronze through gold for that data path) is the most straightforward approach.

Documentation: https://docs.databricks.com/en/ldp/updates.html


OPTION 2: USE AUTO CDC (APPLY CHANGES) FOR SILVER AND GOLD

If your occasional delete-and-reload pattern is something you want to handle without any manual refresh steps, consider changing your silver and gold table definitions to use AUTO CDC (formerly APPLY CHANGES INTO) instead of plain append streams.

With AUTO CDC, you define a sequencing column and a primary key. The pipeline will then correctly handle inserts, updates, and deletes from the upstream table as change events rather than as append-only data. When you delete and reload records in bronze, silver will process the deletes and the new inserts as CDC events automatically.

SQL example for silver:

CREATE STREAMING TABLE silver_table;

AUTO CDC INTO LIVE.silver_table
FROM STREAM(LIVE.bronze_table)
KEYS (record_id)
SEQUENCE BY event_timestamp;

Python example:

import dlt
from pyspark.sql.functions import col

dlt.create_streaming_table("silver_table")

@Dlt.append_flow(target="silver_table")
@Dlt.expect_or_drop("valid_id", "record_id IS NOT NULL")

# Or using apply_changes:
dlt.apply_changes(
target="silver_table",
source="bronze_table",
keys=["record_id"],
sequence_by=col("event_timestamp")
)

This approach requires that you enable Change Data Feed (CDF) on your bronze table so that downstream streams can capture the deletes. You can enable CDF by adding this table property:

TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')

Documentation on AUTO CDC: https://docs.databricks.com/en/ldp/cdc.html
Documentation on Change Data Feed: https://docs.databricks.com/en/delta/delta-change-data-feed.html


OPTION 3: SEPARATE BACKFILL FLOW USING APPEND FLOW

If you want to keep your main pipeline as pure append-only streams but still handle occasional reloads, you can add a separate append flow that targets the same streaming table. This lets you backfill or reload specific data without touching the primary streaming flow.

SQL example:

-- Your main flow (always running)
CREATE OR REFRESH STREAMING TABLE bronze_table
AS SELECT * FROM STREAM read_files('s3://bucket/path');

-- A one-time backfill flow for reload scenarios
INSERT INTO LIVE.silver_table BY NAME
SELECT * FROM LIVE.bronze_table
WHERE reload_batch_id = 'batch_2026_03';

The one-time append flow runs once and appends the specified data to the target without interfering with the main streaming checkpoint.

Documentation on append flows: https://docs.databricks.com/en/ldp/flows.html


OPTION 4: RESET STREAMING CHECKPOINTS FOR SPECIFIC FLOWS

If you want to keep existing data and force the downstream streams to reprocess from the beginning, you can reset checkpoints for specific flows using the REST API:

POST /api/2.0/pipelines/{pipeline_id}/updates
{
"reset_checkpoint_selection": ["silver_flow_name", "gold_flow_name"]
}

This clears the streaming checkpoint for the specified flows without deleting data from the tables. On the next pipeline run, those flows will reprocess all available data from their sources. Be aware this may create duplicate records in the target tables unless you add deduplication logic.


PROTECTING TABLES FROM ACCIDENTAL FULL REFRESH

For tables that should never be fully refreshed (for example, tables with very large history), you can set:

TBLPROPERTIES ('pipelines.reset.allowed' = 'false')

This prevents accidental full refreshes on those specific tables.


SUMMARY

For your use case where you need occasional delete-and-reload while keeping the pipeline append-only most of the time, I would recommend:

- Short term: Use selective full refresh (Option 1) on just the affected tables in the chain. This is the simplest path and does not require pipeline changes.
- Long term: If delete-and-reload becomes more frequent, consider restructuring silver and gold to use AUTO CDC (Option 2), which natively handles deletes and reloads as CDC events.

Hope this helps! Let me know if you have questions about implementing any of these approaches.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.