Databricks Community

SahiSammu · 3 weeks ago

Hi everyone,

I'm seeing a dramatic difference in processing times between batch and streaming (Auto Loader) approaches for reading about 250,000 files from S3 in Databricks. My goal is to read metadata from these files and register it as a table (eventually use autoloader backup option). Here’s the comparison:

Batch approach (2 minutes for 250k files):

df = (
spark.read.format("binaryFile")
.option("recursiveFileLookup", "true")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
)
df.write.saveAsTable("some_table")

Auto Loader streaming approach (2.5 hours for 250k files):

write_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
.writeStream
.outputMode("overwrite")
.option("checkpointLocation", f"{some_checkpoint}")
.trigger(availableNow=True)
.table(f"{some_table}")
)
write_stream.awaitTermination()

Why does Auto Loader take so much longer?
Same file count and S3 path
Same basic selection of columns

The only difference is using .read.format() vs .readStream.format("cloudFiles")

Am I missing something fundamental about how Auto Loader is designed for large initial loads?
Is all this overhead expected, and should I always use batch for historical loads and reserve Auto Loader only for incremental/real-time workflows?

Thanks in advance for your insights!

K_Anudeep · 3 weeks ago

Hello @SahiSammu ,

Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.
Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:

1. Listing/discovering candidate files
2. Filtering ones already seen (from their state)
3. Planning and executing a Spark job and then committing a Delta transaction
So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

So to answer your question, if you want to use

One-off, large historical backfill
→ Prefer batch (or COPY INTO) for speed and simplicity.
Ongoing ingestion / new files / exactly-once semantics
→ Use Auto Loader and a tuned maxFilesPerTrigger

Anudeep

View solution in original post

K_Anudeep · 3 weeks ago

Hello @SahiSammu ,

Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.
Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:

1. Listing/discovering candidate files
2. Filtering ones already seen (from their state)
3. Planning and executing a Spark job and then committing a Delta transaction
So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

So to answer your question, if you want to use

One-off, large historical backfill
→ Prefer batch (or COPY INTO) for speed and simplicity.
Ongoing ingestion / new files / exactly-once semantics
→ Use Auto Loader and a tuned maxFilesPerTrigger

Anudeep

SahiSammu · 3 weeks ago

Thank you, Anudeep.

I plan to tune Auto Loader by increasing the maxFilesPerTrigger parameter to optimize performance. My decision to use Auto Loader is primarily driven by its built-in backup functionality via cloudFiles.cleanSource.moveDestination, which eliminates the need to maintain custom code for file cleanup.

If there is a better option to back up files after ingestion, please feel free to suggest it.

Thank you,

Databricks Community

Auto Loader vs Batch for Large File Loads

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples