- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-25-2024 10:56 PM
Hi,
If I use dropDuplicates inside foreachBatch, the dropDuplicates will become stateless and no state. It just drop duplicates for the current micro batch so I don't have to specify watermark. Is this true?
Thanks
- Labels:
-
Delta Lake
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2024 12:22 AM
Hi @Brad ,
Yes, you are correct. foreachBatch operates on a single batch at time, so you can safely use drop duplicates without worrying about state management.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2024 12:22 AM
Hi @Brad ,
Yes, you are correct. foreachBatch operates on a single batch at time, so you can safely use drop duplicates without worrying about state management.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2024 10:30 AM
Yes, you're correct! When using dropDuplicates within foreachBatch, it operates only on the current micro-batch, so it removes duplicates in a stateless manner for each batch independently. Since there's no continuous state tracking across batches, you don't need to specify a watermark.
If you need to track duplicates across batches, you'd have to handle state management explicitly or use a different approach.

