<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Auto Loader duplicate tracking in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/auto-loader-duplicate-tracking/m-p/161152#M54992</link>
    <description>&lt;P&gt;Hi experts, I read an article about auto loader duplicates handling that got me bit confused. It is checkpoint that tracks what is being processed , and upon confirmation it process only the new incoming records. But, let's say I am reloading whole bulk of records again that includes previously processed records, then would auto loader checks transaction file to ensure that those records that already been processed discarded? or it silently process already processed records. Thank you.&lt;/P&gt;</description>
    <pubDate>Thu, 02 Jul 2026 06:59:15 GMT</pubDate>
    <dc:creator>Sam500</dc:creator>
    <dc:date>2026-07-02T06:59:15Z</dc:date>
    <item>
      <title>Auto Loader duplicate tracking</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-duplicate-tracking/m-p/161152#M54992</link>
      <description>&lt;P&gt;Hi experts, I read an article about auto loader duplicates handling that got me bit confused. It is checkpoint that tracks what is being processed , and upon confirmation it process only the new incoming records. But, let's say I am reloading whole bulk of records again that includes previously processed records, then would auto loader checks transaction file to ensure that those records that already been processed discarded? or it silently process already processed records. Thank you.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jul 2026 06:59:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-duplicate-tracking/m-p/161152#M54992</guid>
      <dc:creator>Sam500</dc:creator>
      <dc:date>2026-07-02T06:59:15Z</dc:date>
    </item>
    <item>
      <title>Re: Auto Loader duplicate tracking</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-duplicate-tracking/m-p/161186#M54999</link>
      <description>&lt;P&gt;Hi, Auto Loader's tracking is at the file level, not the record level, and that distinction is exactly what's tripping you up here.&lt;/P&gt;
&lt;P&gt;The checkpoint keeps a RocksDB-backed record of every file it has already discovered and ingested, keyed by things like path and modification time. So if your "bulk reload" is literally pointing Auto Loader back at the same files it already saw and those files haven't changed, it'll recognize them from checkpoint state and skip them, no reprocessing, no duplicate rows.&lt;/P&gt;
&lt;P&gt;Where it falls apart is if the same records show up in a file Auto Loader hasn't seen before, a new filename, a file that got deleted and re-landed, or the same content re-exported into a differently named batch. Auto Loader has no idea those rows already exist downstream, it just sees a new file and ingests it, so you'll get duplicates in your table. Same thing happens if you turn on &lt;CODE&gt;cloudFiles.allowOverwrites&lt;/CODE&gt; and a file gets modified in place, Auto Loader reprocesses the whole file again rather than diffing it, which also produces duplicate records unless you handle it yourself.&lt;/P&gt;
&lt;P&gt;Bottom line: Auto Loader guarantees exactly-once processing per file, not exactly-once per record. If duplicate records across files or reloads are a real risk in your pipeline, you need dedup logic on top, either &lt;CODE&gt;dropDuplicates&lt;/CODE&gt; in the stream or a &lt;CODE&gt;MERGE INTO&lt;/CODE&gt; keyed on a natural/business key when writing to your Delta table.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jul 2026 09:43:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-duplicate-tracking/m-p/161186#M54999</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2026-07-02T09:43:32Z</dc:date>
    </item>
  </channel>
</rss>

