<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks Full Refresh of DLT Pipeline in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122861#M46887</link>
    <description>&lt;P&gt;Hi, your understanding is accurate-with a few nuances about DLT&amp;nbsp; and how it handles full refreshes with external source tables regarding deletes and retention.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;During full refresh : Clears the pipeline state and output tables , then reprocesses all data from source table as it currently exists.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Handling deleted in the source table: Deletes in the source table are not automatically tracked unless you implement a Change Data Capture (CDC)
&lt;UL&gt;
&lt;LI&gt;By default, DLT streaming live tables operate in append-only mode and do not handle deletes or updates. Only new rows are processed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;If your source table has a retention policy but retains historical versions—such as with Delta tables—then a full refresh pulls in all data present in the current physical state of that source (meaning, whatever is visible at the time of the DLT refresh t = now). Deletes that are still present because of retention may appear in CDC if you use it, but not in ordinary append-only ingestion pipelines&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;If retention purges deleted data (via VACUUM), then deleted records truly disappear and won’t be visible in the full refresh either.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;Use CDC and /or row tracking to ensure deletes and updates in the source table are tracked and correctly reflected downstream through DLT.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Retention of historical data in the source can cause more data to be reprocessed if those records persist in the source table at refresh time.&lt;/LI&gt;
&lt;LI&gt;Full refresh reprocesses all currently available data, which means it can indeed reprocess historical data (including anything not deleted at the source) within retention limits.&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Wed, 25 Jun 2025 16:27:19 GMT</pubDate>
    <dc:creator>paolajara</dc:creator>
    <dc:date>2025-06-25T16:27:19Z</dc:date>
    <item>
      <title>Databricks Full Refresh of DLT Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122794#M46869</link>
      <description>&lt;P&gt;Hello, I have a question regarding the full refresh of a DLT pipeline, where the data source is an external table.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;When running the pipeline &lt;STRONG&gt;without&lt;/STRONG&gt; &lt;STRONG&gt;a full refresh&lt;/STRONG&gt;, then&lt;STRONG&gt; the streaming will pull data which are currently present&lt;/STRONG&gt; in the external source table, meaning that if the table has N rows, then&amp;nbsp; these N rows are going to be processed by the pipeline.&lt;BR /&gt;&lt;BR /&gt;However, when &lt;STRONG&gt;running a full refresh&lt;/STRONG&gt;, it seems that older data will be reprocessed. Meaning, &lt;STRONG&gt;i&lt;/STRONG&gt;&lt;STRONG&gt;t will pull all data from the source table, considering the retention period for the source table but ignoring any DELETE statement on the source table.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Is my understanding correct ?&lt;BR /&gt;&lt;BR /&gt;Thanks in advance&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 10:21:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122794#M46869</guid>
      <dc:creator>NikosLoutas</dc:creator>
      <dc:date>2025-06-25T10:21:52Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Full Refresh of DLT Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122861#M46887</link>
      <description>&lt;P&gt;Hi, your understanding is accurate-with a few nuances about DLT&amp;nbsp; and how it handles full refreshes with external source tables regarding deletes and retention.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;During full refresh : Clears the pipeline state and output tables , then reprocesses all data from source table as it currently exists.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Handling deleted in the source table: Deletes in the source table are not automatically tracked unless you implement a Change Data Capture (CDC)
&lt;UL&gt;
&lt;LI&gt;By default, DLT streaming live tables operate in append-only mode and do not handle deletes or updates. Only new rows are processed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;If your source table has a retention policy but retains historical versions—such as with Delta tables—then a full refresh pulls in all data present in the current physical state of that source (meaning, whatever is visible at the time of the DLT refresh t = now). Deletes that are still present because of retention may appear in CDC if you use it, but not in ordinary append-only ingestion pipelines&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;If retention purges deleted data (via VACUUM), then deleted records truly disappear and won’t be visible in the full refresh either.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;Use CDC and /or row tracking to ensure deletes and updates in the source table are tracked and correctly reflected downstream through DLT.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Retention of historical data in the source can cause more data to be reprocessed if those records persist in the source table at refresh time.&lt;/LI&gt;
&lt;LI&gt;Full refresh reprocesses all currently available data, which means it can indeed reprocess historical data (including anything not deleted at the source) within retention limits.&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 25 Jun 2025 16:27:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122861#M46887</guid>
      <dc:creator>paolajara</dc:creator>
      <dc:date>2025-06-25T16:27:19Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Full Refresh of DLT Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122865#M46889</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170268"&gt;@paolajara&lt;/a&gt; — in your point 5 you mentioned using Delta Lake for tracking changes. Could you point me to any official docs or examples that walk through enabling CDC / row-tracking on a Delta table?&lt;/P&gt;&lt;P&gt;I pull data from SharePoint via its REST endpoint, which gives me full snapshots but no explicit delete flags. Because hard-deletes aren’t surfaced directly, I need to compare each new snapshot with the previous one to spot rows that have disappeared and propagate those deletes downstream in Databricks. Do you have a recommended pattern (row-tracking + anti-join, CDF etc.) for implementing this? Any guidance or links would be much appreciated—thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 16:58:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-full-refresh-of-dlt-pipeline/m-p/122865#M46889</guid>
      <dc:creator>seeyesbee</dc:creator>
      <dc:date>2025-06-25T16:58:36Z</dc:date>
    </item>
  </channel>
</rss>

