- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-15-2024 03:41 PM
Hi @ashraf1395,
Just a few comments about your question:
The cloudFiles source in Databricks is designed for incremental file processing. However, it depends on the checkpoint directory to track which files have been processed.
The cloudFiles.includeExistingFiles option determines whether to include existing files in the stream processing input path or to only process new files arriving after the initial setup. This option is evaluated only when you start a stream for the first time. includeExistingFiles=False: This configuration prevents files already present in the directory from being processed during the first run of the pipeline. It doesn’t stop those files from being reprocessed if the checkpoint directory is reset.
Since you’re using outputMode("append"), every processed record is appended to the target table. Without deduplication, duplicates from previously processed files will accumulate.
To avoid the duplicates you might want to implement deduplication on the target table.