K_Anudeep
Databricks Employee
Databricks Employee

Hello @SahiSammu ,

  • Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.

  • Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:

    1. Listing/discovering candidate files
    2. Filtering ones already seen (from their state)
    3. Planning and executing a Spark job and then committing a Delta transaction

  • So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours

 

So to answer your question, if you want to use 

  • One-off, large historical backfill
    → Prefer batch (or COPY INTO) for speed and simplicity.
  • Ongoing ingestion / new files / exactly-once semantics
    → Use Auto Loader and a tuned maxFilesPerTrigger

 

Anudeep

View solution in original post