cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Full Refresh of DLT Pipeline

NikosLoutas
New Contributor III

Hello, I have a question regarding the full refresh of a DLT pipeline, where the data source is an external table. 

When running the pipeline without a full refresh, then the streaming will pull data which are currently present in the external source table, meaning that if the table has N rows, then  these N rows are going to be processed by the pipeline.

However, when running a full refresh, it seems that older data will be reprocessed. Meaning, it will pull all data from the source table, considering the retention period for the source table but ignoring any DELETE statement on the source table. 

Is my understanding correct ?

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions

paolajara
Databricks Employee
Databricks Employee

Hi, your understanding is accurate-with a few nuances about DLT  and how it handles full refreshes with external source tables regarding deletes and retention.

  1. During full refresh : Clears the pipeline state and output tables , then reprocesses all data from source table as it currently exists. 
  2. Handling deleted in the source table: Deletes in the source table are not automatically tracked unless you implement a Change Data Capture (CDC)
    • By default, DLT streaming live tables operate in append-only mode and do not handle deletes or updates. Only new rows are processed.
  3. If your source table has a retention policy but retains historical versions—such as with Delta tables—then a full refresh pulls in all data present in the current physical state of that source (meaning, whatever is visible at the time of the DLT refresh t = now). Deletes that are still present because of retention may appear in CDC if you use it, but not in ordinary append-only ingestion pipelines

  4. If retention purges deleted data (via VACUUM), then deleted records truly disappear and won’t be visible in the full refresh either.

  5. Use CDC and /or row tracking to ensure deletes and updates in the source table are tracked and correctly reflected downstream through DLT. 
  6. Retention of historical data in the source can cause more data to be reprocessed if those records persist in the source table at refresh time.
  7. Full refresh reprocesses all currently available data, which means it can indeed reprocess historical data (including anything not deleted at the source) within retention limits.

View solution in original post

2 REPLIES 2

paolajara
Databricks Employee
Databricks Employee

Hi, your understanding is accurate-with a few nuances about DLT  and how it handles full refreshes with external source tables regarding deletes and retention.

  1. During full refresh : Clears the pipeline state and output tables , then reprocesses all data from source table as it currently exists. 
  2. Handling deleted in the source table: Deletes in the source table are not automatically tracked unless you implement a Change Data Capture (CDC)
    • By default, DLT streaming live tables operate in append-only mode and do not handle deletes or updates. Only new rows are processed.
  3. If your source table has a retention policy but retains historical versions—such as with Delta tables—then a full refresh pulls in all data present in the current physical state of that source (meaning, whatever is visible at the time of the DLT refresh t = now). Deletes that are still present because of retention may appear in CDC if you use it, but not in ordinary append-only ingestion pipelines

  4. If retention purges deleted data (via VACUUM), then deleted records truly disappear and won’t be visible in the full refresh either.

  5. Use CDC and /or row tracking to ensure deletes and updates in the source table are tracked and correctly reflected downstream through DLT. 
  6. Retention of historical data in the source can cause more data to be reprocessed if those records persist in the source table at refresh time.
  7. Full refresh reprocesses all currently available data, which means it can indeed reprocess historical data (including anything not deleted at the source) within retention limits.

seeyesbee
New Contributor II

Hi @paolajara — in your point 5 you mentioned using Delta Lake for tracking changes. Could you point me to any official docs or examples that walk through enabling CDC / row-tracking on a Delta table?

I pull data from SharePoint via its REST endpoint, which gives me full snapshots but no explicit delete flags. Because hard-deletes aren’t surfaced directly, I need to compare each new snapshot with the previous one to spot rows that have disappeared and propagate those deletes downstream in Databricks. Do you have a recommended pattern (row-tracking + anti-join, CDF etc.) for implementing this? Any guidance or links would be much appreciated—thanks!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now