cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Questions on Auto Loader auto Listing Logic

JIWON
New Contributor III

Hi everyone,

I’m investigating some performance patterns in our Auto Loader (S3) pipelines and would like to clarify the internal listing logic.

Context: We run a batch job every hour using Auto Loader. Recently, after March 10th, we noticed our execution time jumped from 1 minute to over 5 minutes. I've confirmed from the March 10 release notes that the default value for useIncrementalListing was changed to false, which explains the sudden performance drop. Explicitly setting it to true resolved this issue.

The Mystery (Periodic Spikes): However, looking at the data before March 10th (when auto or true was the default), I noticed a consistent pattern: execution times increased significantly every 8 hours at UTC 04:00, 12:00, and 20:00.

My Questions:

  1. Does cloudFiles.fullDirectoryScanInterval actually exist? I’ve heard this option controls the interval for full scans when using useIncrementalListing = "auto". Is this a valid/supported configuration?

  2. Is the default interval 8 hours? The UTC 04, 12, 20 pattern is too consistent to be a coincidence. I'd like to know if Auto Loader is hard-coded (or defaulted) to perform a "Full Listing" every 8 hours even when in incremental mode.

  3. Internal Logic of "auto": How exactly does Auto Loader decide when to perform a full vs. incremental scan when set to "auto"? Is it purely time-based, or does it depend on other factors?

P.S. I am aware that Databricks recommends File Events for production, but due to cost and the lack of real-time requirements, we prefer the 1-hour batch interval approach.

Looking forward to your insights!

1 ACCEPTED SOLUTION

Accepted Solutions

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @JIWON ,

1. There is no such option;

2. Assuming that the job is triggered every hour, the spikes every 8-hours can be explained by this:

To ensure eventual completeness of data in auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

3. So, if you want to reduce / increase the full scan frequency, you can set up an interval with the cloudFiles.backfillInterval option, for example .option("cloudFiles.backfillInterval", "1 week"). Just bear in mind that the full listing is needed to include any missed files, so doing it more rarely means that there will be potentially some missed data.


Hope it helps.


P.S. Really curious to understand your requirements for real-time which are not compatible with the File events mode. You would still be able to run job every hour (and not in real-time) with File events mode.

Best regards,

View solution in original post

2 REPLIES 2

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @JIWON ,

1. There is no such option;

2. Assuming that the job is triggered every hour, the spikes every 8-hours can be explained by this:

To ensure eventual completeness of data in auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

3. So, if you want to reduce / increase the full scan frequency, you can set up an interval with the cloudFiles.backfillInterval option, for example .option("cloudFiles.backfillInterval", "1 week"). Just bear in mind that the full listing is needed to include any missed files, so doing it more rarely means that there will be potentially some missed data.


Hope it helps.


P.S. Really curious to understand your requirements for real-time which are not compatible with the File events mode. You would still be able to run job every hour (and not in real-time) with File events mode.

Best regards,

JIWON
New Contributor III

Hi aleksandra_ch,

Thank you so much for the detailed explanation! I feel a bit embarrassed realizing I hadn't thoroughly checked the documentation before asking.

As you pointed out, since my Auto Loader runs as an hourly batch, the "7 incremental + 1 full listing" logic perfectly explains why I was seeing performance spikes every 8 hours. After discovering that the default for useIncrementalListing was changed to false in the March 10 release, I explicitly set it to true, and the issue has been resolved.

I am aware that using incremental listing alone carries a risk of missing files. However, given that our S3 data is Hive-partitioned (year/month/day/hour) and the filenames themselves include timestamps, the risk seems low—though I agree it's not 100% foolproof.

Also, your P.S. was a real eye-opener! I had always associated "File Events" mode exclusively with real-time streaming, so I hadn't even explored using it for our hourly batches. I'll definitely look into implementing that to see if it provides better stability for our pipeline.

Thank you again for your help and for sharing such great insights.

Best regards, Jiwon