cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What does autoloader's cloudfiles.backfillInterval do?

FabriceDeseyn
Contributor

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.

I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection of the files and therefor it seems to me that this is an effect of the no-100% guarantee of file notification system.

Now I am wondering what the option 'cloudfiles.backfillInterval' will actually do as I find the documentation ambiguous.

Will `cloudfiles.backfillInterval':

  • Recompute the cloud_files-listing every interval and only process the new files
  • or recompute the cloud_files-listing every interval and process all files?

PS: When looking at the cloud_files-listing I do not get any discovery_times, I suppose these are only relevant in file notification mode?

image

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

View solution in original post

6 REPLIES 6

Lakshay
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

Hi @Lakshay Goel​ ,

So to make sure I correctly understood your answer (see snippet below):

Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

only the old files that have not been processed will be processed?

Lakshay
Esteemed Contributor

Yes, that is correct

Kiranrathod
New Contributor III

Hi @Lakshay Goel​ ,

where can I set the backFillInterval property in the code? Do you have any sample codes for this use case?

g96g
New Contributor III

You do it when you read the files as .option("cloudFiles.backfillInterval", "1 week")

 

 
 

822025
New Contributor II

If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?

For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding basis over time ? or it will just ensure that 1ce in 24 hours it will run a full scan of non processed files and process them ?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group