topic Re: Databricks Full Refresh of DLT Pipeline in Data Engineering

Databricks Full Refresh of DLT Pipeline

NikosLoutas — Wed, 25 Jun 2025 10:21:52 GMT

Hello, I have a question regarding the full refresh of a DLT pipeline, where the data source is an external table.

When running the pipeline without a full refresh, then the streaming will pull data which are currently present in the external source table, meaning that if the table has N rows, then these N rows are going to be processed by the pipeline.

However, when running a full refresh, it seems that older data will be reprocessed. Meaning, it will pull all data from the source table, considering the retention period for the source table but ignoring any DELETE statement on the source table.

Is my understanding correct ?

Thanks in advance

Re: Databricks Full Refresh of DLT Pipeline

paolajara — Wed, 25 Jun 2025 16:27:19 GMT

Hi, your understanding is accurate-with a few nuances about DLT and how it handles full refreshes with external source tables regarding deletes and retention.

During full refresh : Clears the pipeline state and output tables , then reprocesses all data from source table as it currently exists.
Handling deleted in the source table: Deletes in the source table are not automatically tracked unless you implement a Change Data Capture (CDC)
- By default, DLT streaming live tables operate in append-only mode and do not handle deletes or updates. Only new rows are processed.
If your source table has a retention policy but retains historical versions—such as with Delta tables—then a full refresh pulls in all data present in the current physical state of that source (meaning, whatever is visible at the time of the DLT refresh t = now). Deletes that are still present because of retention may appear in CDC if you use it, but not in ordinary append-only ingestion pipelines
If retention purges deleted data (via VACUUM), then deleted records truly disappear and won’t be visible in the full refresh either.
Use CDC and /or row tracking to ensure deletes and updates in the source table are tracked and correctly reflected downstream through DLT.
Retention of historical data in the source can cause more data to be reprocessed if those records persist in the source table at refresh time.
Full refresh reprocesses all currently available data, which means it can indeed reprocess historical data (including anything not deleted at the source) within retention limits.

Re: Databricks Full Refresh of DLT Pipeline

seeyesbee — Wed, 25 Jun 2025 16:58:36 GMT

Hi @paolajara — in your point 5 you mentioned using Delta Lake for tracking changes. Could you point me to any official docs or examples that walk through enabling CDC / row-tracking on a Delta table?

I pull data from SharePoint via its REST endpoint, which gives me full snapshots but no explicit delete flags. Because hard-deletes aren’t surfaced directly, I need to compare each new snapshot with the previous one to spot rows that have disappeared and propagate those deletes downstream in Databricks. Do you have a recommended pattern (row-tracking + anti-join, CDF etc.) for implementing this? Any guidance or links would be much appreciated—thanks!