Auto Loader duplicate tracking

Sam500 — Thu, 02 Jul 2026 06:59:15 GMT

Hi experts, I read an article about auto loader duplicates handling that got me bit confused. It is checkpoint that tracks what is being processed , and upon confirmation it process only the new incoming records. But, let's say I am reloading whole bulk of records again that includes previously processed records, then would auto loader checks transaction file to ensure that those records that already been processed discarded? or it silently process already processed records. Thank you.

Re: Auto Loader duplicate tracking

iyashk-DB — Thu, 02 Jul 2026 09:43:32 GMT

Hi, Auto Loader's tracking is at the file level, not the record level, and that distinction is exactly what's tripping you up here.

The checkpoint keeps a RocksDB-backed record of every file it has already discovered and ingested, keyed by things like path and modification time. So if your "bulk reload" is literally pointing Auto Loader back at the same files it already saw and those files haven't changed, it'll recognize them from checkpoint state and skip them, no reprocessing, no duplicate rows.

Where it falls apart is if the same records show up in a file Auto Loader hasn't seen before, a new filename, a file that got deleted and re-landed, or the same content re-exported into a differently named batch. Auto Loader has no idea those rows already exist downstream, it just sees a new file and ingests it, so you'll get duplicates in your table. Same thing happens if you turn on cloudFiles.allowOverwrites and a file gets modified in place, Auto Loader reprocesses the whole file again rather than diffing it, which also produces duplicate records unless you handle it yourself.

Bottom line: Auto Loader guarantees exactly-once processing per file, not exactly-once per record. If duplicate records across files or reloads are a real risk in your pipeline, you need dedup logic on top, either dropDuplicates in the stream or a MERGE INTO keyed on a natural/business key when writing to your Delta table.

topic Auto Loader duplicate tracking in Data Engineering

Auto Loader duplicate tracking

Re: Auto Loader duplicate tracking