Greetings @ChrisLawford_n1 , I did some research and would like to share that to help you.
Short answer: Even with managed file events, Auto Loader has to do a full directory listing the very first time a new stream starts. That initial scan is how it establishes a safe read position in the file-events cache and the stream checkpoint. This happens even if includeExistingFiles=false, can’t be bypassed, and can’t be replaced with a user-provided file list. Each stream establishes its own position, so you pay that first-run cost per stream.
Why it behaves this way
Managed file events work by caching metadata for recent file creates/updates and then incrementally discovering new files from that cache on subsequent runs. But on the first run, Auto Loader must do a full listing to “get current” and persist a starting position in the checkpoint.
Even with includeExistingFiles=false, that initial directory listing still happens so Auto Loader can identify files created after the stream start and anchor its read position. There’s no supported way to skip this step.
Also worth calling out: manual tuning knobs like cloudFiles.fetchParallelism are ignored with managed file events. Databricks automatically tunes the listing and discovery behavior, so there isn’t a way to increase the number of parallel folder-listing workers.
Practical ways to reduce the time or impact
The biggest win is ingest once, then fan out. Stand up a single Auto Loader stream into a raw or bronze table, and build downstream silver/gold tables from that data instead of starting multiple ingestion streams. That avoids paying the initial discovery cost over and over.
Use Unity Catalog volumes and narrow paths. Point Auto Loader at a volume subdirectory (for example, /Volumes/catalog/schema/volume/subdir) instead of a broad cloud path. Volumes use a more optimized listing pattern and tend to perform much better during the first scan.
Keep the stream warm. Run each managed-file-events stream at least once every seven days. The cache only retains recent metadata; if a stream sits idle longer than that, it will fall back to a full listing again.
For one-off backfills, use Trigger.AvailableNow and tune cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger to control batch size and throughput without leaving a long-running stream around.
If the new stream truly doesn’t need historical files, add a time filter like modifiedAfter. That constrains the initial listing to newer files and significantly reduces first-run work.
If you absolutely must use directory listing mode instead of file events (generally not recommended), make sure files are laid out lexicographically (YYYY/MM/DD/HH or monotonic prefixes). That helps Auto Loader prune listings, but avoid the deprecated incremental listing option and prefer file events whenever possible.
Clarifying a couple assumptions
Managed file events do keep a cache of recent file metadata, but it’s not a permanent catalog of everything that has ever existed. Each new stream still has to scan the directory once to synchronize and checkpoint its own read position.
There’s also no supported way to “hand” Auto Loader a precomputed file list to skip that initial listing. Exactly-once guarantees come from Auto Loader owning its own state via the checkpoint and cache.
Hoping this helps, Louis.