Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-04-2025 07:09 AM - edited 11-04-2025 07:15 AM
Hello @SahiSammu ,
- Reading the data directly, lists all objects under source_s3_path_default once and creates a logical DataFrame comprising ~250k files. and then write to the target Delta table, which creates a single commit.
- Auto Loader ingests files in micro-batches. By default, the autoloader ingests ~1000 files in a batch, then 250k files ≈ , 250 micro-batches and each micro-batch involves:
1. Listing/discovering candidate files
2. Filtering ones already seen (from their state)
3. Planning and executing a Spark job and then committing a Delta transaction - So if each microbatch takes even 30 seconds to process, then the total time taken to process would be 30sec * 250 batches= ~2 hours
So to answer your question, if you want to use
- One-off, large historical backfill
→ Prefer batch (or COPY INTO) for speed and simplicity. - Ongoing ingestion / new files / exactly-once semantics
→ Use Auto Loader and a tuned maxFilesPerTrigger
Anudeep