01-28-2026 07:30 AM
Hello, I am using databricks autoloader with managedfileevents turned on and include existing files.
I want to understand if there is a way of increasing the speed of the initial listing of the files for autoloader.
I thought that the idea behind the managed file events mode was that essentially Databricks stores a list of all the file events that have come in and when you create an autoloader instance to read from a path under the container that has the fileevents enabled. Autoloader will make a custom queue to for that instance of autoloader and replay the events.
The issue I have is that when running large tests the speed of the include existing files is very slow and needs to be redone if for example I created another instance of autoloader to write to another table. Is there a way of passing in the list of existing files or increasing the number of processes that are listing files e.g. at each folder level adding another process to walk that directory.
01-28-2026 08:15 AM
Greetings @ChrisLawford_n1 , I did some research and would like to share that to help you.
Short answer: Even with managed file events, Auto Loader has to do a full directory listing the very first time a new stream starts. That initial scan is how it establishes a safe read position in the file-events cache and the stream checkpoint. This happens even if includeExistingFiles=false, can’t be bypassed, and can’t be replaced with a user-provided file list. Each stream establishes its own position, so you pay that first-run cost per stream.
Why it behaves this way
Managed file events work by caching metadata for recent file creates/updates and then incrementally discovering new files from that cache on subsequent runs. But on the first run, Auto Loader must do a full listing to “get current” and persist a starting position in the checkpoint.
Even with includeExistingFiles=false, that initial directory listing still happens so Auto Loader can identify files created after the stream start and anchor its read position. There’s no supported way to skip this step.
Also worth calling out: manual tuning knobs like cloudFiles.fetchParallelism are ignored with managed file events. Databricks automatically tunes the listing and discovery behavior, so there isn’t a way to increase the number of parallel folder-listing workers.
Practical ways to reduce the time or impact
The biggest win is ingest once, then fan out. Stand up a single Auto Loader stream into a raw or bronze table, and build downstream silver/gold tables from that data instead of starting multiple ingestion streams. That avoids paying the initial discovery cost over and over.
Use Unity Catalog volumes and narrow paths. Point Auto Loader at a volume subdirectory (for example, /Volumes/catalog/schema/volume/subdir) instead of a broad cloud path. Volumes use a more optimized listing pattern and tend to perform much better during the first scan.
Keep the stream warm. Run each managed-file-events stream at least once every seven days. The cache only retains recent metadata; if a stream sits idle longer than that, it will fall back to a full listing again.
For one-off backfills, use Trigger.AvailableNow and tune cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger to control batch size and throughput without leaving a long-running stream around.
If the new stream truly doesn’t need historical files, add a time filter like modifiedAfter. That constrains the initial listing to newer files and significantly reduces first-run work.
If you absolutely must use directory listing mode instead of file events (generally not recommended), make sure files are laid out lexicographically (YYYY/MM/DD/HH or monotonic prefixes). That helps Auto Loader prune listings, but avoid the deprecated incremental listing option and prefer file events whenever possible.
Clarifying a couple assumptions
Managed file events do keep a cache of recent file metadata, but it’s not a permanent catalog of everything that has ever existed. Each new stream still has to scan the directory once to synchronize and checkpoint its own read position.
There’s also no supported way to “hand” Auto Loader a precomputed file list to skip that initial listing. Exactly-once guarantees come from Auto Loader owning its own state via the checkpoint and cache.
Hoping this helps, Louis.
01-28-2026 08:15 AM
Greetings @ChrisLawford_n1 , I did some research and would like to share that to help you.
Short answer: Even with managed file events, Auto Loader has to do a full directory listing the very first time a new stream starts. That initial scan is how it establishes a safe read position in the file-events cache and the stream checkpoint. This happens even if includeExistingFiles=false, can’t be bypassed, and can’t be replaced with a user-provided file list. Each stream establishes its own position, so you pay that first-run cost per stream.
Why it behaves this way
Managed file events work by caching metadata for recent file creates/updates and then incrementally discovering new files from that cache on subsequent runs. But on the first run, Auto Loader must do a full listing to “get current” and persist a starting position in the checkpoint.
Even with includeExistingFiles=false, that initial directory listing still happens so Auto Loader can identify files created after the stream start and anchor its read position. There’s no supported way to skip this step.
Also worth calling out: manual tuning knobs like cloudFiles.fetchParallelism are ignored with managed file events. Databricks automatically tunes the listing and discovery behavior, so there isn’t a way to increase the number of parallel folder-listing workers.
Practical ways to reduce the time or impact
The biggest win is ingest once, then fan out. Stand up a single Auto Loader stream into a raw or bronze table, and build downstream silver/gold tables from that data instead of starting multiple ingestion streams. That avoids paying the initial discovery cost over and over.
Use Unity Catalog volumes and narrow paths. Point Auto Loader at a volume subdirectory (for example, /Volumes/catalog/schema/volume/subdir) instead of a broad cloud path. Volumes use a more optimized listing pattern and tend to perform much better during the first scan.
Keep the stream warm. Run each managed-file-events stream at least once every seven days. The cache only retains recent metadata; if a stream sits idle longer than that, it will fall back to a full listing again.
For one-off backfills, use Trigger.AvailableNow and tune cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger to control batch size and throughput without leaving a long-running stream around.
If the new stream truly doesn’t need historical files, add a time filter like modifiedAfter. That constrains the initial listing to newer files and significantly reduces first-run work.
If you absolutely must use directory listing mode instead of file events (generally not recommended), make sure files are laid out lexicographically (YYYY/MM/DD/HH or monotonic prefixes). That helps Auto Loader prune listings, but avoid the deprecated incremental listing option and prefer file events whenever possible.
Clarifying a couple assumptions
Managed file events do keep a cache of recent file metadata, but it’s not a permanent catalog of everything that has ever existed. Each new stream still has to scan the directory once to synchronize and checkpoint its own read position.
There’s also no supported way to “hand” Auto Loader a precomputed file list to skip that initial listing. Exactly-once guarantees come from Auto Loader owning its own state via the checkpoint and cache.
Hoping this helps, Louis.
02-05-2026 07:47 AM
Hey Louis,
Yeah Your explanation of each autoloader instance being responsible for its own position is what hit home. We are just unfortunate that the autoloader instances we have contain glob patterns that overlap a lot so the indexing to the outside user is all the same for each instance of autoloader. This makes it incredibly slow for that initial run. I was just trying to find a way of speeding this up.
2 weeks ago
Hi @ChrisLawford_n1,
You are correct that managed file events (cloudFiles.useManagedFileEvents = true) works by having Databricks maintain a record of file events on the external location, so when you start a new Auto Loader stream, it can replay those events rather than doing a full directory listing. However, as you have observed, when cloudFiles.includeExistingFiles is set to true (which is the default), Auto Loader still needs to perform an initial directory listing on the very first run to discover all files that existed before the file events service started tracking. This initial listing is the bottleneck you are experiencing.
Here are several approaches to improve performance in your scenario:
OPTION 1: DISABLE INCLUDE EXISTING FILES FOR SUBSEQUENT STREAMS
If you have already ingested all existing files through your first Auto Loader stream, additional streams reading from the same path do not necessarily need to re-list all existing files. You can set:
.option("cloudFiles.includeExistingFiles", "false")
This tells the new Auto Loader instance to only pick up files that arrive after the stream starts. Since managed file events are active on the external location, the new stream will immediately start receiving notifications for new files without any initial listing overhead.
If you do need historical files in the new target table, consider populating it with a one-time batch read (spark.read) from the source path or from an existing Delta table, and then switch to Auto Loader for incremental processing going forward.
OPTION 2: USE BACKFILL INTERVAL INSTEAD OF INCLUDE EXISTING FILES
Rather than having Auto Loader do a potentially slow initial listing, you can set cloudFiles.includeExistingFiles to false and instead use a periodic backfill to catch any files that may have been missed:
.option("cloudFiles.includeExistingFiles", "false")
.option("cloudFiles.backfillInterval", "1 day")
Note: cloudFiles.backfillInterval is not compatible with cloudFiles.useManagedFileEvents. So if you take this approach, you would use standard file notification mode instead of managed file events. The backfill runs asynchronously and does not block your stream processing.
OPTION 3: INCREASE FETCH PARALLELISM
The cloudFiles.fetchParallelism option controls the number of threads used when fetching messages from the queueing service. The default is 1. Increasing this can help when there is a large volume of file events to replay:
.option("cloudFiles.fetchParallelism", "8")
Note that this option is documented as not applicable when cloudFiles.useManagedFileEvents is true, so this is more relevant for standard file notification mode. If you are specifically using managed file events, this may not help with the initial listing phase.
OPTION 4: PARTITION YOUR INPUT PATHS
Instead of pointing a single Auto Loader instance at a broad top-level path, consider splitting the workload across multiple streams, each pointing at a more specific sub-path. For example, if your data is organized by date or category:
# Stream 1
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.useManagedFileEvents", "true") \
.load("s3://bucket/data/year=2025/")
# Stream 2
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.useManagedFileEvents", "true") \
.load("s3://bucket/data/year=2026/")
Each stream has a smaller directory tree to list, and they can run in parallel. You can write all of them to the same target Delta table. This effectively gives you the per-directory-level parallelism you are looking for.
OPTION 5: INITIAL BATCH LOAD PLUS INCREMENTAL STREAMING
For very large existing datasets, the most performant pattern is often to separate the initial load from the ongoing incremental ingestion:
1. Do a one-time batch load of all existing files:
df = spark.read.format("parquet").load("s3://bucket/data/")
df.write.format("delta").mode("overwrite").saveAsTable("catalog.schema.target_table")
2. Then start Auto Loader for incremental processing with includeExistingFiles disabled:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.useManagedFileEvents", "true") \
.option("cloudFiles.includeExistingFiles", "false") \
.load("s3://bucket/data/") \
.writeStream \
.option("checkpointLocation", "/checkpoints/target_table") \
.trigger(availableNow=True) \
.toTable("catalog.schema.target_table")
This avoids the slow listing entirely for the streaming portion.
ADDITIONAL TIPS
- Make sure you are running Databricks Runtime 15.4 LTS or newer, as this version includes improvements that prevent waiting for full RocksDB state downloads before stream startup, which can improve overall stream initialization time.
- Run your Auto Loader streams at least once every 7 days when using file events. If more than 7 days pass between runs, the file events cache expires and Auto Loader falls back to a full directory listing.
- If you are using Trigger.AvailableNow, file discovery happens asynchronously with data processing, which can improve overall throughput during the initial catch-up phase.
For the full list of Auto Loader configuration options, see:
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options.html
For production best practices:
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/production.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.