topic Auto Loader vs Batch for Large File Loads in Data Engineering

Auto Loader vs Batch for Large File Loads

SahiSammu — Tue, 04 Nov 2025 12:24:05 GMT

Hi everyone,

I'm seeing a dramatic difference in processing times between batch and streaming (Auto Loader) approaches for reading about 250,000 files from S3 in Databricks. My goal is to read metadata from these files and register it as a table (eventually use autoloader backup option). Here’s the comparison:

Batch approach (2 minutes for 250k files):

df = (
spark.read.format("binaryFile")
.option("recursiveFileLookup", "true")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
)
df.write.saveAsTable("some_table")

Auto Loader streaming approach (2.5 hours for 250k files):

write_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
.writeStream
.outputMode("overwrite")
.option("checkpointLocation", f"{some_checkpoint}")
.trigger(availableNow=True)
.table(f"{some_table}")
)
write_stream.awaitTermination()

Why does Auto Loader take so much longer?
Same file count and S3 path
Same basic selection of columns

The only difference is using .read.format() vs .readStream.format("cloudFiles")

Am I missing something fundamental about how Auto Loader is designed for large initial loads?
Is all this overhead expected, and should I always use batch for historical loads and reserve Auto Loader only for incremental/real-time workflows?

Thanks in advance for your insights!

Re: Auto Loader vs Batch for Large File Loads

K_Anudeep — Tue, 04 Nov 2025 15:15:03 GMT

Hello @SahiSammu ,

Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.
Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:

1. Listing/discovering candidate files
2. Filtering ones already seen (from their state)
3. Planning and executing a Spark job and then committing a Delta transaction
So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

So to answer your question, if you want to use

One-off, large historical backfill
→ Prefer batch (or COPY INTO) for speed and simplicity.
Ongoing ingestion / new files / exactly-once semantics
→ Use Auto Loader and a tuned maxFilesPerTrigger

Re: Auto Loader vs Batch for Large File Loads

SahiSammu — Tue, 04 Nov 2025 18:04:41 GMT

Thank you, Anudeep.

I plan to tune Auto Loader by increasing the maxFilesPerTrigger parameter to optimize performance. My decision to use Auto Loader is primarily driven by its built-in backup functionality via cloudFiles.cleanSource.moveDestination, which eliminates the need to maintain custom code for file cleanup.

If there is a better option to back up files after ingestion, please feel free to suggest it.

Thank you,