cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Auto Loader cloudFiles.backfillInterval

therealchainman
New Contributor II

Hello, 

I have been reading databricks Auto Loader documentation about cloudFiles.backfillInterval configuration, and have a question about a specific detail on how it works still.  I was only able to find examples of it being set to 1 day or 1 week.  So I'm assuming you can enter any time in there such as x hours, x days, x weeks, x months.  My question is how does it uses that 1 week to backfill. 

Does it look at the lastModified time on the files arriving in the input directory that have not been processed and calculates currentTime - lastModified <= backfillInterval. 

Or does it run once a week the backfill, so if I ran the databricks autoloader pipeline last week, it will perform a backfill?  In that case the backfill might just look through all the files in the input directory and the cloud_file_state and make sure all have been processed?  

I'm not getting a good picture of what exactly backfillInterval does?  But it seems to be good, says it guarantees 100% of files to be processed.

 

1 ACCEPTED SOLUTION

Accepted Solutions

saipujari_spark
Databricks Employee
Databricks Employee

Hey @therealchainman 

The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.

Hope this answers your question.

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

View solution in original post

4 REPLIES 4

Lakshay
Databricks Employee
Databricks Employee

@therealchainman 

The backFillInterval option is provided to make sure eventually all the files are inserted. When you create a new stream, some files might be missed that are not ingested. BackFill is an asynchronous process which is trigerred based on the interval defined by backFillInterval option. This checks for all the files that have been missed and ingests those files

saipujari_spark
Databricks Employee
Databricks Employee

Hey @therealchainman 

The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.

Hope this answers your question.

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

How to use  cloudFiles.backfillInterval in our code & also which property we need to set?

Lakshay
Databricks Employee
Databricks Employee

Hi @Kiranrathod , you can use the property "cloudFiles.backfillInterval"  to us the backfill. Please refer the doc:https://docs.databricks.com/en/ingestion/auto-loader/options.html#:~:text=cloudFiles.backfillInterva...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group