โ08-29-2025 06:23 PM
I am getting thousands of records every second in my bronze table from Qlik and every second the bronze table is getting truncated and load with new data by Qlik itself. How do I process this much data every second to my silver streaming table before the bronze table gets truncated with new data with a DLT pipeline? Does DLT pipeline has this much power that if it runs in continuous mode, it can fetch these many records every second without losing any data? And my bronze table is a must truncate load and this cannot be changed.
โ08-30-2025 04:04 AM
Core Problem
Why This Breaks Streaming
Streaming queries in Databricks track offsets or files appended.
If Qlik truncates, then:
The data that was there is gone.
DLT sees the same table โstart overโ every second โ leads to lost micro-batches.
No checkpointing can recover truncated rows.
So in the current setup, youโre effectively treating the bronze table like a volatile cache, not a durable streaming source.
Options to Solve This
1. Add a Durable Append Layer Before DLT
Instead of pointing DLT to the truncate-load bronze table, introduce an append-only ingestion layer.
Example:
Qlik โ writes to staging (truncate every sec).
A lightweight job (Auto Loader or Structured Streaming with foreachBatch) โ copies new rows into an append-only Delta table (true bronze).
DLT (continuous) โ reads from this append-only table safely.
This decouples Qlikโs truncate pattern from your streaming system.
2. Snapshot Approach (Batch DLT) - i would recommend this
If you must keep truncate load, then treat each secondโs truncate-load as a full snapshot.
DLT can run in triggered batch mode every second (or every few seconds):
Compare the new snapshot with the last snapshot.
Compute delta changes (insert/update/delete).
Write results to silver.
Downside: not true โstreaming,โ but avoids data loss.
โ08-30-2025 10:04 PM
Thanks Manoj for your reply. Could you please explain the 2nd snapshot method in detail? What exactly I need to do? Also I have one more question. If my DLT streaming table can always read the data from an append only table, how do I control the data from keep growing in that source table? How do I set retention policy on my source table?
3 weeks ago - last edited 3 weeks ago
@Anubhav2011, here is more information on the snapshot method: https://docs.databricks.com/aws/en/dlt/cdc#how-is-cdc-implemented-with-the-auto-cdc-from-snapshot-ap...
This process efficiently determines changes in source data by comparing a series of snapshots taken in order. It then executes the necessary processing for change data capture (CDC) of the records in those snapshots. This functionality is supported exclusively by the Lakeflow Declarative Pipelines Python interface.
By the end of the year, there are plans to incorporate TTL (time to live) functionality for Delta tables (both managed and DLT tables). The timeline may change, but in the meantime, you can set up a job with vacuum.
https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum
2 weeks ago - last edited 2 weeks ago
Initial question broken down
1. Truncate+insert bronze problem
Streaming/DLT continuous assumes append-only. With truncate+insert youโll always lose data (engine sees โresetโ not โappendโ).
fix: donโt stream directly from bronze. Instead, capture each truncate-load as a full snapshot and append it into a new bronze_snapshots table. That preserves every secondโs data before bronze is wiped.
Thatโs the bronze_snapshots DLT transform in the PoC.
2. Can DLT continuous handle it?
No, because the issue is not throughput but data semantics. Even if DLT is fast enough, it canโt checkpoint against a source that resets.
answer: run DLT in triggered batch mode (every second/few seconds). Each trigger captures the latest bronze snapshot and appends it. Then use APPLY CHANGES (or MERGE) to compute deltas into silver.
3. Append-only source growth / retention
Once you switch to bronze_snapshots (append-only), yes it will grow forever.
answer: control size with Delta retention + VACUUM + OPTIMIZE
1) Create an append-only snapshots table (DLT Python)
2) Apply changes from snapshots to silver (preferred: DLT apply_changes)
If your Databricks workspace supports DLT apply_changes() (APPLY CHANGES INTO), use it because itโs declarative and handles ordering and deletes.
2 weeks ago
The Apply Changes API is getting deprecated.
The AUTO CDC
APIs replace the APPLY CHANGES
APIs, and have the same syntax. The APPLY CHANGES
APIs are still available, but Databricks recommends using the AUTO CDC
APIs in their place.
Please refer to the latest documentation
https://docs.databricks.com/aws/en/dlt/cdc
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now