cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

yit
New Contributor III

Hey everyone,

I’m trying to clarify a confusion in AutoLoader regarding trigger batches and micro-batches when using .forEachBatch.

Here’s what I understand so far:

  1. Trigger batch – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger. This determines how many new files Auto Loader reads per streaming trigger.

  2. Micro-batch in .forEachBatch – This is the batch of data your callback function receives. 

My questions are:

1. Are trigger batches and .forEachBatch micro-batches exactly the same thing?

2. If they are not the same, do they map one on one? For example, if I have maxFilesPerTrigger=10, does each .forEachBatch call always receive exactly 10 files (if available), or could it receive more or fewer depending on internal Spark scheduling?

3. Can I set the .forEachBatch microbatch size, just as I set the trigger size, or it's internal Spark configuration?

4. Does the trigger type affects any of the upper responses (availableNow, time-scheduled trigger, real-time streaming)? 

5. Any suggestions to keep in mind for initial (historic) load?

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @yit ,

1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger)

2. Micro-batch - this is your unit of data that the query executes on. If you use .forEachBatch, Spark gives your function one micro-batch at a time.

So, you can think of it in following way: a "trigger batch" of files produces the input data that Spark will turn into a micro-batch. 

As example, if you set trigger batch to have maxFilesPerTrigger=10 then Spark will list at most new files in that trigger. And that set of files will become an input for one micro-batch.

But keep in mind if fewer than 10 files are available, your micro-batch will be smaller.

As about controlling the size of microbatch - you're just doing that by setting maxFilesPerTrigger and maxBytesPerTrigger. Remeber, these settings will produce "input"  that micro-batch will operate on, so hence it directly influence the size of micro-batch.

Configure Structured Streaming batch size on Azure Databricks - Azure Databricks | Microsoft Learn

View solution in original post

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @yit ,

1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger)

2. Micro-batch - this is your unit of data that the query executes on. If you use .forEachBatch, Spark gives your function one micro-batch at a time.

So, you can think of it in following way: a "trigger batch" of files produces the input data that Spark will turn into a micro-batch. 

As example, if you set trigger batch to have maxFilesPerTrigger=10 then Spark will list at most new files in that trigger. And that set of files will become an input for one micro-batch.

But keep in mind if fewer than 10 files are available, your micro-batch will be smaller.

As about controlling the size of microbatch - you're just doing that by setting maxFilesPerTrigger and maxBytesPerTrigger. Remeber, these settings will produce "input"  that micro-batch will operate on, so hence it directly influence the size of micro-batch.

Configure Structured Streaming batch size on Azure Databricks - Azure Databricks | Microsoft Learn

yit
New Contributor III

@szymon_dybczak  So, their relation is one-on-one? Does one trigger batch always maps to one micro-batch?

szymon_dybczak
Esteemed Contributor III

Yes, you can think in that way about it.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now