cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Questions on Auto Loader auto Listing Logic

JIWON
New Contributor III

Hi everyone,

I’m investigating some performance patterns in our Auto Loader (S3) pipelines and would like to clarify the internal listing logic.

Context: We run a batch job every hour using Auto Loader. Recently, after March 10th, we noticed our execution time jumped from 1 minute to over 5 minutes. I've confirmed from the March 10 release notes that the default value for useIncrementalListing was changed to false, which explains the sudden performance drop. Explicitly setting it to true resolved this issue.

The Mystery (Periodic Spikes): However, looking at the data before March 10th (when auto or true was the default), I noticed a consistent pattern: execution times increased significantly every 8 hours at UTC 04:00, 12:00, and 20:00.

My Questions:

  1. Does cloudFiles.fullDirectoryScanInterval actually exist? I’ve heard this option controls the interval for full scans when using useIncrementalListing = "auto". Is this a valid/supported configuration?

  2. Is the default interval 8 hours? The UTC 04, 12, 20 pattern is too consistent to be a coincidence. I'd like to know if Auto Loader is hard-coded (or defaulted) to perform a "Full Listing" every 8 hours even when in incremental mode.

  3. Internal Logic of "auto": How exactly does Auto Loader decide when to perform a full vs. incremental scan when set to "auto"? Is it purely time-based, or does it depend on other factors?

P.S. I am aware that Databricks recommends File Events for production, but due to cost and the lack of real-time requirements, we prefer the 1-hour batch interval approach.

Looking forward to your insights!

1 REPLY 1

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @JIWON ,

1. There is no such option;

2. Assuming that the job is triggered every hour, the spikes every 8-hours can be explained by this:

To ensure eventual completeness of data in auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

3. So, if you want to reduce / increase the full scan frequency, you can set up an interval with the cloudFiles.backfillInterval option, for example .option("cloudFiles.backfillInterval", "1 week"). Just bear in mind that the full listing is needed to include any missed files, so doing it more rarely means that there will be potentially some missed data.


Hope it helps.


P.S. Really curious to understand your requirements for real-time which are not compatible with the File events mode. You would still be able to run job every hour (and not in real-time) with File events mode.

Best regards,