cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Auto Loader vs Batch for Large File Loads

SahiSammu
New Contributor II

Hi everyone,

I'm seeing a dramatic difference in processing times between batch and streaming (Auto Loader) approaches for reading about 250,000 files from S3 in Databricks. My goal is to read metadata from these files and register it as a table (eventually use autoloader backup option). Hereโ€™s the comparison:

Batch approach (2 minutes for 250k files):

df = (
  spark.read.format("binaryFile")
  .option("recursiveFileLookup", "true")
  .load(source_s3_path_default)
  .select("path", "modificationTime", "length")
)
df.write.saveAsTable("some_table")


Auto Loader streaming approach (2.5 hours for 250k files):

write_stream = (
spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "binaryFile")
  .load(source_s3_path_default)
  .select("path", "modificationTime", "length")
  .writeStream
  .outputMode("overwrite")
  .option("checkpointLocation", f"{some_checkpoint}")
  .trigger(availableNow=True)
  .table(f"{some_table}")
)
write_stream.awaitTermination()


Why does Auto Loader take so much longer?
Same file count and S3 path
Same basic selection of columns

The only difference is using .read.format() vs .readStream.format("cloudFiles")

Am I missing something fundamental about how Auto Loader is designed for large initial loads?
Is all this overhead expected, and should I always use batch for historical loads and reserve Auto Loader only for incremental/real-time workflows?

Thanks in advance for your insights!

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hello @SahiSammu ,

  • Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.

  • Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files โ‰ˆ , 250 micro-batches and each micro-batch involves:

    1. Listing/discovering candidate files
    2. Filtering ones already seen (from their state)
    3. Planning and executing a Spark job and then committing a Delta transaction

  • So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

 

So to answer your question, if you want to use 

  • One-off, large historical backfill
    โ†’ Prefer batch (or COPY INTO) for speed and simplicity.
  • Ongoing ingestion / new files / exactly-once semantics
    โ†’ Use Auto Loader and a tuned maxFilesPerTrigger

 

Anudeep

View solution in original post

2 REPLIES 2

K_Anudeep
Databricks Employee
Databricks Employee

Hello @SahiSammu ,

  • Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.

  • Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files โ‰ˆ , 250 micro-batches and each micro-batch involves:

    1. Listing/discovering candidate files
    2. Filtering ones already seen (from their state)
    3. Planning and executing a Spark job and then committing a Delta transaction

  • So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

 

So to answer your question, if you want to use 

  • One-off, large historical backfill
    โ†’ Prefer batch (or COPY INTO) for speed and simplicity.
  • Ongoing ingestion / new files / exactly-once semantics
    โ†’ Use Auto Loader and a tuned maxFilesPerTrigger

 

Anudeep

SahiSammu
New Contributor II

Thank you, Anudeep.

I plan to tune Auto Loader by increasing the maxFilesPerTrigger parameter to optimize performance. My decision to use Auto Loader is primarily driven by its built-in backup functionality via cloudFiles.cleanSource.moveDestination, which eliminates the need to maintain custom code for file cleanup.

If there is a better option to back up files after ingestion, please feel free to suggest it.

Thank you,

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now