cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the Power of DLT Pipeline to read streaming data

Anubhav2011
New Contributor II

I am getting thousands of records every second in my bronze table from Qlik and every second the bronze table is getting truncated and load with new data by Qlik itself. How do I process this much data every second to my silver streaming table before the bronze table gets truncated with new data with a DLT pipeline? Does DLT pipeline has this much power that if it runs in continuous mode, it can fetch these many records every second without losing any data? And my bronze table is a must truncate load and this cannot be changed.

3 REPLIES 3

ManojkMohan
Valued Contributor III

Core Problem

  • Bronze table is not append-only, but truncate + insert every second.
  • DLT (Delta Live Tables) in continuous mode assumes append-only streaming sources (like Kafka).
  • Because Qlik wipes and replaces data every second, DLT cannot guarantee no data loss if you read bronze directly in streaming mode.

Why This Breaks Streaming
Streaming queries in Databricks track offsets or files appended.
If Qlik truncates, then:
The data that was there is gone.
DLT sees the same table โ€œstart overโ€ every second โ†’ leads to lost micro-batches.
No checkpointing can recover truncated rows.
So in the current setup, youโ€™re effectively treating the bronze table like a volatile cache, not a durable streaming source.

Options to Solve This
1. Add a Durable Append Layer Before DLT

Instead of pointing DLT to the truncate-load bronze table, introduce an append-only ingestion layer.

Example:

Qlik โ†’ writes to staging (truncate every sec).

A lightweight job (Auto Loader or Structured Streaming with foreachBatch) โ†’ copies new rows into an append-only Delta table (true bronze).

DLT (continuous) โ†’ reads from this append-only table safely.

This decouples Qlikโ€™s truncate pattern from your streaming system.

2. Snapshot Approach (Batch DLT) - i would recommend this

If you must keep truncate load, then treat each secondโ€™s truncate-load as a full snapshot.

DLT can run in triggered batch mode every second (or every few seconds):

Compare the new snapshot with the last snapshot.

Compute delta changes (insert/update/delete).

Write results to silver.

Downside: not true โ€œstreaming,โ€ but avoids data loss. 

Thanks Manoj for your reply. Could you please explain the 2nd snapshot method in detail? What exactly I need to do? Also I have one more question. If my DLT streaming table can always read the data from an append only table, how do I control the data from keep growing in that source table? How do I set retention policy on my source table?

@Anubhav2011, here is more information on the snapshot method: https://docs.databricks.com/aws/en/dlt/cdc#how-is-cdc-implemented-with-the-auto-cdc-from-snapshot-ap...

This process efficiently determines changes in source data by comparing a series of snapshots taken in order. It then executes the necessary processing for change data capture (CDC) of the records in those snapshots. This functionality is supported exclusively by the Lakeflow Declarative Pipelines Python interface.

By the end of the year, there are plans to incorporate TTL (time to live) functionality for Delta tables (both managed and DLT tables). The timeline may change, but in the meantime, you can set up a job with vacuum.

https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now