Re: Auto Loader vs Batch for Large File Loads

K_Anudeep · ‎11-04-2025

Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.
Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:

1. Listing/discovering candidate files
2. Filtering ones already seen (from their state)
3. Planning and executing a Spark job and then committing a Delta transaction
So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

So to answer your question, if you want to use

One-off, large historical backfill
→ Prefer batch (or COPY INTO) for speed and simplicity.
Ongoing ingestion / new files / exactly-once semantics
→ Use Auto Loader and a tuned maxFilesPerTrigger

Anudeep