cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What does autoloader's cloudfiles.backfillInterval do?

FabriceDeseyn
Contributor

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.

I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection of the files and therefor it seems to me that this is an effect of the no-100% guarantee of file notification system.

Now I am wondering what the option 'cloudfiles.backfillInterval' will actually do as I find the documentation ambiguous.

Will `cloudfiles.backfillInterval':

  • Recompute the cloud_files-listing every interval and only process the new files
  • or recompute the cloud_files-listing every interval and process all files?

PS: When looking at the cloud_files-listing I do not get any discovery_times, I suppose these are only relevant in file notification mode?

image

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

View solution in original post

5 REPLIES 5

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Fabrice Deseyn​ , the backFillInterval option is to make sure that eventually all the files get processed. The backfill does not work on the new files. All the new files are processed as per your configuration of the directory listing or the file notification mode. Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

I would also suggest using either file notification mode or incremental listing for better performance.

Hi @Lakshay Goel​ ,

So to make sure I correctly understood your answer (see snippet below):

Since there is no 100% guarantee that all files will be processed, the backfill process runs asynchronously to pick up any old files that have not been processed. Using backFillinterval, you can control how the old files will be processed.

only the old files that have not been processed will be processed?

Lakshay
Esteemed Contributor
Esteemed Contributor

Yes, that is correct

Kiranrathod
New Contributor III

Hi @Lakshay Goel​ ,

where can I set the backFillInterval property in the code? Do you have any sample codes for this use case?

g96g
New Contributor III

You do it when you read the files as .option("cloudFiles.backfillInterval", "1 week")

 

 
 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!