โ03-15-2023 02:52 AM
I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.
I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection of the files and therefor it seems to me that this is an effect of the no-100% guarantee of file notification system.
Now I am wondering what the option 'cloudfiles.backfillInterval' will actually do as I find the documentation ambiguous.
Will `cloudfiles.backfillInterval':
PS: When looking at the cloud_files-listing I do not get any discovery_times, I suppose these are only relevant in file notification mode?
โ03-15-2023 05:52 AM
Hi @Fabrice Deseynโ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.
I would also suggest using either file notification mode or incremental listing for better performance.
โ03-15-2023 05:52 AM
Hi @Fabrice Deseynโ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.
I would also suggest using either file notification mode or incremental listing for better performance.
โ03-15-2023 06:06 AM
Hi @Lakshay Goelโ ,
So to make sure I correctly understood your answer (see snippet below):
Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.
only the old files that have not been processed will be processed?
โ03-15-2023 06:09 AM
Yes, that is correct
โ10-23-2023 03:02 AM
Hi @Lakshay Goelโ ,
where can I set the backFillInterval property in the code? Do you have any sample codes for this use case?
โ11-27-2023 05:36 AM
You do it when you read the files as .option("cloudFiles.backfillInterval", "1 week")
โ08-30-2024 08:19 AM
If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?
For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding basis over time ? or it will just ensure that 1ce in 24 hours it will run a full scan of non processed files and process them ?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group