topic Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch) in Data Engineering

Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

yit — Tue, 02 Sep 2025 10:56:02 GMT

Hey everyone,

I’m trying to clarify a confusion in AutoLoader regarding trigger batches and micro-batches when using .forEachBatch.

Here’s what I understand so far:

Trigger batch – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger. This determines how many new files Auto Loader reads per streaming trigger.
Micro-batch in .forEachBatch – This is the batch of data your callback function receives.

My questions are:

1. Are trigger batches and .forEachBatch micro-batches exactly the same thing?

2. If they are not the same, do they map one on one? For example, if I have maxFilesPerTrigger=10, does each .forEachBatch call always receive exactly 10 files (if available), or could it receive more or fewer depending on internal Spark scheduling?

3. Can I set the .forEachBatch microbatch size, just as I set the trigger size, or it's internal Spark configuration?

4. Does the trigger type affects any of the upper responses (availableNow, time-scheduled trigger, real-time streaming)?

5. Any suggestions to keep in mind for initial (historic) load?

Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

szymon_dybczak — Tue, 02 Sep 2025 11:05:52 GMT

Hi @yit ,

1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger)

2. Micro-batch - this is your unit of data that the query executes on. If you use .forEachBatch, Spark gives your function one micro-batch at a time.

So, you can think of it in following way: a "trigger batch" of files produces the input data that Spark will turn into a micro-batch.

As example, if you set trigger batch to have maxFilesPerTrigger=10 then Spark will list at most new files in that trigger. And that set of files will become an input for one micro-batch.

But keep in mind if fewer than 10 files are available, your micro-batch will be smaller.

As about controlling the size of microbatch - you're just doing that by setting maxFilesPerTrigger and maxBytesPerTrigger. Remeber, these settings will produce "input" that micro-batch will operate on, so hence it directly influence the size of micro-batch.

Configure Structured Streaming batch size on Azure Databricks - Azure Databricks | Microsoft Learn

Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

yit — Tue, 02 Sep 2025 11:50:14 GMT

@szymon_dybczak So, their relation is one-on-one? Does one trigger batch always maps to one micro-batch?

Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

szymon_dybczak — Tue, 02 Sep 2025 12:04:22 GMT

Yes, you can think in that way about it.