cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What does autoloader's cloudfiles.backfillInterval do?

FabriceDeseyn
Contributor

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.

I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection of the files and therefor it seems to me that this is an effect of the no-100% guarantee of file notification system.

Now I am wondering what the option 'cloudfiles.backfillInterval' will actually do as I find the documentation ambiguous.

Will `cloudfiles.backfillInterval':

  • Recompute the cloud_files-listing every interval and only process the new files
  • or recompute the cloud_files-listing every interval and process all files?

PS: When looking at the cloud_files-listing I do not get any discovery_times, I suppose these are only relevant in file notification mode?

image

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

View solution in original post

5 REPLIES 5

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

Hi @Lakshay Goel​ ,

So to make sure I correctly understood your answer (see snippet below):

Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

only the old files that have not been processed will be processed?

Lakshay
Esteemed Contributor
Esteemed Contributor

Yes, that is correct

Kiranrathod
New Contributor III

Hi @Lakshay Goel​ ,

where can I set the backFillInterval property in the code? Do you have any sample codes for this use case?

g96g
New Contributor III

You do it when you read the files as .option("cloudFiles.backfillInterval", "1 week")

 

 
 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.