Databricks Community

Anubhav2011 · ‎08-29-2025

I am getting thousands of records every second in my bronze table from Qlik and every second the bronze table is getting truncated and load with new data by Qlik itself. How do I process this much data every second to my silver streaming table before the bronze table gets truncated with new data with a DLT pipeline? Does DLT pipeline has this much power that if it runs in continuous mode, it can fetch these many records every second without losing any data? And my bronze table is a must truncate load and this cannot be changed.

ManojkMohan · ‎08-30-2025

Core Problem

Bronze table is not append-only, but truncate + insert every second.
DLT (Delta Live Tables) in continuous mode assumes append-only streaming sources (like Kafka).
Because Qlik wipes and replaces data every second, DLT cannot guarantee no data loss if you read bronze directly in streaming mode.

Why This Breaks Streaming
Streaming queries in Databricks track offsets or files appended.
If Qlik truncates, then:
The data that was there is gone.
DLT sees the same table “start over” every second → leads to lost micro-batches.
No checkpointing can recover truncated rows.
So in the current setup, you’re effectively treating the bronze table like a volatile cache, not a durable streaming source.

Options to Solve This
1. Add a Durable Append Layer Before DLT

Instead of pointing DLT to the truncate-load bronze table, introduce an append-only ingestion layer.

Example:

Qlik → writes to staging (truncate every sec).

A lightweight job (Auto Loader or Structured Streaming with foreachBatch) → copies new rows into an append-only Delta table (true bronze).

DLT (continuous) → reads from this append-only table safely.

This decouples Qlik’s truncate pattern from your streaming system.

2. Snapshot Approach (Batch DLT) - i would recommend this

If you must keep truncate load, then treat each second’s truncate-load as a full snapshot.

DLT can run in triggered batch mode every second (or every few seconds):

Compare the new snapshot with the last snapshot.

Compute delta changes (insert/update/delete).

Write results to silver.

Downside: not true “streaming,” but avoids data loss.

Anubhav2011 · ‎08-30-2025

Thanks Manoj for your reply. Could you please explain the 2nd snapshot method in detail? What exactly I need to do? Also I have one more question. If my DLT streaming table can always read the data from an append only table, how do I control the data from keep growing in that source table? How do I set retention policy on my source table?

Krishna_S · ‎09-22-2025

@Anubhav2011, here is more information on the snapshot method: https://docs.databricks.com/aws/en/dlt/cdc#how-is-cdc-implemented-with-the-auto-cdc-from-snapshot-ap...

This process efficiently determines changes in source data by comparing a series of snapshots taken in order. It then executes the necessary processing for change data capture (CDC) of the records in those snapshots. This functionality is supported exclusively by the Lakeflow Declarative Pipelines Python interface.

By the end of the year, there are plans to incorporate TTL (time to live) functionality for Delta tables (both managed and DLT tables). The timeline may change, but in the meantime, you can set up a job with vacuum.

https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum

ManojkMohan · ‎09-24-2025

Initial question broken down

Bronze table is truncate+insert every second. How can I process this into Silver with DLT before data is wiped? Does DLT continuous mode have the power to keep up?
If I instead use an append-only source, how do I stop it from growing forever (retention)?

1. Truncate+insert bronze problem

Streaming/DLT continuous assumes append-only. With truncate+insert you’ll always lose data (engine sees “reset” not “append”).

fix: don’t stream directly from bronze. Instead, capture each truncate-load as a full snapshot and append it into a new bronze_snapshots table. That preserves every second’s data before bronze is wiped.

That’s the bronze_snapshots DLT transform in the PoC.

2. Can DLT continuous handle it?

No, because the issue is not throughput but data semantics. Even if DLT is fast enough, it can’t checkpoint against a source that resets.

answer: run DLT in triggered batch mode (every second/few seconds). Each trigger captures the latest bronze snapshot and appends it. Then use APPLY CHANGES (or MERGE) to compute deltas into silver.

3. Append-only source growth / retention

Once you switch to bronze_snapshots (append-only), yes it will grow forever.

answer: control size with Delta retention + VACUUM + OPTIMIZE

1) Create an append-only snapshots table (DLT Python)

2) Apply changes from snapshots to silver (preferred: DLT apply_changes)

If your Databricks workspace supports DLT apply_changes() (APPLY CHANGES INTO), use it because it’s declarative and handles ordering and deletes.

Krishna_S · ‎09-24-2025

The Apply Changes API is getting deprecated.

The AUTO CDC APIs replace the APPLY CHANGES APIs, and have the same syntax. The APPLY CHANGES APIs are still available, but Databricks recommends using the AUTO CDC APIs in their place.

Please refer to the latest documentation

https://docs.databricks.com/aws/en/dlt/cdc