szymon_dybczak
Esteemed Contributor III

Hi @8b1tz ,

Once again, if you use auto loader it guarantees exactly once semantics, so there shouldn't be any duplicates.
The same applies if you were to use Event Hub, it's just different data source, but same concept of structured streaming applies (auto loader is built upon structered streaming).

Below is a snippet from documentation:

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.

In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.

 

I highly recommend you to get to know how auto loader work (or more generally, how structured streaming works). Read documentation, watch some video on YT.