โ07-18-2023 10:33 PM
Hello,
I have been reading databricks Auto Loader documentation about cloudFiles.backfillInterval configuration, and have a question about a specific detail on how it works still. I was only able to find examples of it being set to 1 day or 1 week. So I'm assuming you can enter any time in there such as x hours, x days, x weeks, x months. My question is how does it uses that 1 week to backfill.
Does it look at the lastModified time on the files arriving in the input directory that have not been processed and calculates currentTime - lastModified <= backfillInterval.
Or does it run once a week the backfill, so if I ran the databricks autoloader pipeline last week, it will perform a backfill? In that case the backfill might just look through all the files in the input directory and the cloud_file_state and make sure all have been processed?
I'm not getting a good picture of what exactly backfillInterval does? But it seems to be good, says it guarantees 100% of files to be processed.
โ07-19-2023 12:14 PM
Hey @therealchainman
The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.
Hope this answers your question.
โ07-19-2023 12:01 PM
The backFillInterval option is provided to make sure eventually all the files are inserted. When you create a new stream, some files might be missed that are not ingested. BackFill is an asynchronous process which is trigerred based on the interval defined by backFillInterval option. This checks for all the files that have been missed and ingests those files
โ07-19-2023 12:14 PM
Hey @therealchainman
The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.
Hope this answers your question.
โ10-20-2023 06:15 AM
โ01-11-2024 08:57 AM
Hi @Kiranrathod , you can use the property "cloudFiles.backfillInterval" to us the backfill. Please refer the doc:https://docs.databricks.com/en/ingestion/auto-loader/options.html#:~:text=cloudFiles.backfillInterva...
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group