cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Questions on Auto Loader auto Listing Logic

JIWON
New Contributor III

Hi everyone,

I’m investigating some performance patterns in our Auto Loader (S3) pipelines and would like to clarify the internal listing logic.

Context: We run a batch job every hour using Auto Loader. Recently, after March 10th, we noticed our execution time jumped from 1 minute to over 5 minutes. I've confirmed from the March 10 release notes that the default value for useIncrementalListing was changed to false, which explains the sudden performance drop. Explicitly setting it to true resolved this issue.

The Mystery (Periodic Spikes): However, looking at the data before March 10th (when auto or true was the default), I noticed a consistent pattern: execution times increased significantly every 8 hours at UTC 04:00, 12:00, and 20:00.

My Questions:

  1. Does cloudFiles.fullDirectoryScanInterval actually exist? I’ve heard this option controls the interval for full scans when using useIncrementalListing = "auto". Is this a valid/supported configuration?

  2. Is the default interval 8 hours? The UTC 04, 12, 20 pattern is too consistent to be a coincidence. I'd like to know if Auto Loader is hard-coded (or defaulted) to perform a "Full Listing" every 8 hours even when in incremental mode.

  3. Internal Logic of "auto": How exactly does Auto Loader decide when to perform a full vs. incremental scan when set to "auto"? Is it purely time-based, or does it depend on other factors?

P.S. I am aware that Databricks recommends File Events for production, but due to cost and the lack of real-time requirements, we prefer the 1-hour batch interval approach.

Looking forward to your insights!

0 REPLIES 0