Databricks Community

ravipal-global · ‎02-09-2026

We have a set of streaming dlt pipelines following a medallion pattern where s3 bucket -> autoloader -> bronze delta tables -> silver delta tables -> gold delta tables. All delta tables are in a unity catalog under separate schemas. We need a solution to be able to delete a set of data records from gold, silver and bronze and reload, by renaming or copying the original files again to the s3 bucket. When we attempted this, the autoloader does pickup the renamed files and reload to the bronze delta table, but does not load further to silver and gold delta tables. Is this expected behavior ? Why does the silver delta table stream reader not able to pickup the newly reloaded data in the bronze delta table ? There is no error, but its seems to completely ignore that reloaded data, but picks up other newly arrived data. We want to have this ability to delete and reload data again using the same pipelines without writing completely separate pipelines with this work. Is there any solution for this ?
Thanks in advance
Ravi Pal

Denton_Cockburn · ‎02-09-2026

Is a full table refresh not an option?

Since these are streaming tables, my assumption is that the offset has already passed the data you reloaded, and the new versions are being resolved as the ones previously processed.

A full refresh will reset the table and offset and reload from scratch. Streaming tables are really meant for append only processing, whereas this use case seems more closely aligned with an auto_cdc pattern.

ravipal-global · ‎02-09-2026

Full refresh is not an option for us. Our tables are append only for regular runs almost 90% of the time. But we do have occasions where we need to backload/remediate previously loaded data again. So we want to delete the data from the delta tables and reload from the source again using the same pipeline. So just wandering what others are doing for similar requirements.

LuizFbm2002 · ‎02-09-2026

I believe this happens because of the checkpoint that delta streams store. When you process data directly from S3 via autoloader, it's reading the files from the bucket. In the silver/gold layer, you start using the stream/delta engine; since there's a checkpoint where data processing stopped, the stream discards what's already been processed and only processes new data. I believe that if you're only going to do this once, you should reset the checkpoint. Or model the pipeline with a CDC structure, since the solution from the colleague above of processing everything doesn't fit your scenario.

SteveOstrowski · ‎03-07-2026

Hi @ravipal-global,

I have seen this pattern before. The behavior you are seeing is expected. Let me explain why it happens and then walk through several approaches that can help you achieve delete-and-reload without requiring a full refresh of your entire pipeline.

WHY SILVER AND GOLD DO NOT PICK UP THE RELOADED DATA

Your pipeline follows the pattern: S3 -> Auto Loader -> Bronze -> Silver (stream) -> Gold (stream)

When you delete records from bronze and Auto Loader re-ingests the renamed files, two things happen at the bronze Delta table level:

1. The DELETE operation creates a "non-append" commit in the Delta log (a commit that removes data).
2. The re-ingested data creates a new append commit.

The silver streaming table reads bronze as a Delta streaming source. By default in Lakeflow Spark Declarative Pipelines (SDP) (formerly Delta Live Tables / SDP), downstream streaming reads use skipChangeCommits behavior, which means the stream ignores any Delta commit that involves deletes or updates. This is designed to keep append-only streams safe from breaking on upstream modifications.

The tricky part is that even though the re-appended data is technically a new append commit, the stream checkpoint may have already advanced past the version where the delete happened, and the interaction between the delete commit and the subsequent append commit can cause the stream to treat the reloaded records as part of a change operation rather than a clean new append. This is why the silver table seems to "ignore" the reloaded data while still picking up genuinely new data that arrives later.

Documentation on skipChangeCommits behavior: https://docs.databricks.com/en/structured-streaming/delta-lake.html

OPTION 1: SELECTIVE FULL REFRESH OF SPECIFIC TABLES (RECOMMENDED)

You do not need to full-refresh your entire pipeline. Lakeflow Spark Declarative Pipelines (SDP) supports refreshing individual tables selectively.

From the Pipeline UI:
1. Open your pipeline
2. Click the dropdown arrow next to the "Start" button
3. Select "Full refresh selection"
4. Choose only the specific bronze, silver, and gold tables that need the reload

Via the REST API:

POST /api/2.0/pipelines/{pipeline_id}/updates
{
"full_refresh_selection": [
"my_schema.bronze_table",
"my_schema.silver_table",
"my_schema.gold_table"
]
}

This clears data and checkpoints for only the selected tables and reprocesses them from their sources, leaving all other tables in your pipeline untouched. Since you mentioned this is only needed occasionally for a subset of data, selectively refreshing the affected chain of tables (bronze through gold for that data path) is the most straightforward approach.

Documentation: https://docs.databricks.com/en/ldp/updates.html

OPTION 2: USE AUTO CDC (APPLY CHANGES) FOR SILVER AND GOLD

If your occasional delete-and-reload pattern is something you want to handle without any manual refresh steps, consider changing your silver and gold table definitions to use AUTO CDC (formerly APPLY CHANGES INTO) instead of plain append streams.

With AUTO CDC, you define a sequencing column and a primary key. The pipeline will then correctly handle inserts, updates, and deletes from the upstream table as change events rather than as append-only data. When you delete and reload records in bronze, silver will process the deletes and the new inserts as CDC events automatically.

SQL example for silver:

CREATE STREAMING TABLE silver_table;

AUTO CDC INTO LIVE.silver_table
FROM STREAM(LIVE.bronze_table)
KEYS (record_id)
SEQUENCE BY event_timestamp;

Python example:

import dlt
from pyspark.sql.functions import col

dlt.create_streaming_table("silver_table")

@Dlt.append_flow(target="silver_table")
@Dlt.expect_or_drop("valid_id", "record_id IS NOT NULL")

# Or using apply_changes:
dlt.apply_changes(
target="silver_table",
source="bronze_table",
keys=["record_id"],
sequence_by=col("event_timestamp")
)

This approach requires that you enable Change Data Feed (CDF) on your bronze table so that downstream streams can capture the deletes. You can enable CDF by adding this table property:

TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')

Documentation on AUTO CDC: https://docs.databricks.com/en/ldp/cdc.html
Documentation on Change Data Feed: https://docs.databricks.com/en/delta/delta-change-data-feed.html

OPTION 3: SEPARATE BACKFILL FLOW USING APPEND FLOW

If you want to keep your main pipeline as pure append-only streams but still handle occasional reloads, you can add a separate append flow that targets the same streaming table. This lets you backfill or reload specific data without touching the primary streaming flow.

SQL example:

-- Your main flow (always running)
CREATE OR REFRESH STREAMING TABLE bronze_table
AS SELECT * FROM STREAM read_files('s3://bucket/path');

-- A one-time backfill flow for reload scenarios
INSERT INTO LIVE.silver_table BY NAME
SELECT * FROM LIVE.bronze_table
WHERE reload_batch_id = 'batch_2026_03';

The one-time append flow runs once and appends the specified data to the target without interfering with the main streaming checkpoint.

Documentation on append flows: https://docs.databricks.com/en/ldp/flows.html

OPTION 4: RESET STREAMING CHECKPOINTS FOR SPECIFIC FLOWS

If you want to keep existing data and force the downstream streams to reprocess from the beginning, you can reset checkpoints for specific flows using the REST API:

POST /api/2.0/pipelines/{pipeline_id}/updates
{
"reset_checkpoint_selection": ["silver_flow_name", "gold_flow_name"]
}

This clears the streaming checkpoint for the specified flows without deleting data from the tables. On the next pipeline run, those flows will reprocess all available data from their sources. Be aware this may create duplicate records in the target tables unless you add deduplication logic.

PROTECTING TABLES FROM ACCIDENTAL FULL REFRESH

For tables that should never be fully refreshed (for example, tables with very large history), you can set:

TBLPROPERTIES ('pipelines.reset.allowed' = 'false')

This prevents accidental full refreshes on those specific tables.

SUMMARY

For your use case where you need occasional delete-and-reload while keeping the pipeline append-only most of the time, I would recommend:

- Short term: Use selective full refresh (Option 1) on just the affected tables in the chain. This is the simplest path and does not require pipeline changes.
- Long term: If delete-and-reload becomes more frequent, consider restructuring silver and gold to use AUTO CDC (Option 2), which natively handles deletes and reloads as CDC events.

Hope this helps! Let me know if you have questions about implementing any of these approaches.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Databricks Community

delete and reload append only delta live tables with autoloader

Databricks Community Champion - April 2026 - Ashwin Varadharajan

Solution Accelerator Series | Analyze Customer Lifetime Value

🌟 Community Pulse: Your Weekly Roundup! April 06 – 12, 2026

Database Branching in Postgres: Git-Style Workflows with Databricks Lakebase

Agent Bricks | A Pilot to Production Series 🚀