- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-18-2023 10:33 PM
Hello,
I have been reading databricks Auto Loader documentation about cloudFiles.backfillInterval configuration, and have a question about a specific detail on how it works still. I was only able to find examples of it being set to 1 day or 1 week. So I'm assuming you can enter any time in there such as x hours, x days, x weeks, x months. My question is how does it uses that 1 week to backfill.
Does it look at the lastModified time on the files arriving in the input directory that have not been processed and calculates currentTime - lastModified <= backfillInterval.
Or does it run once a week the backfill, so if I ran the databricks autoloader pipeline last week, it will perform a backfill? In that case the backfill might just look through all the files in the input directory and the cloud_file_state and make sure all have been processed?
I'm not getting a good picture of what exactly backfillInterval does? But it seems to be good, says it guarantees 100% of files to be processed.
- Labels:
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2023 12:14 PM
Hey @therealchainman
The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.
Hope this answers your question.
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2023 12:01 PM
The backFillInterval option is provided to make sure eventually all the files are inserted. When you create a new stream, some files might be missed that are not ingested. BackFill is an asynchronous process which is trigerred based on the interval defined by backFillInterval option. This checks for all the files that have been missed and ingests those files
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2023 12:14 PM
Hey @therealchainman
The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.
Hope this answers your question.
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-20-2023 06:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2024 08:57 AM
Hi @Kiranrathod , you can use the property "cloudFiles.backfillInterval" to us the backfill. Please refer the doc:https://docs.databricks.com/en/ingestion/auto-loader/options.html#:~:text=cloudFiles.backfillInterva...

