<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130494#M48808</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by&amp;nbsp;&lt;SPAN&gt;cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2. Micro-batch - this is your unit of data that the query executes on. If you use .forEachBatch, Spark gives your function one micro-batch at a time.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;So, you can think of it in following way: a "trigger batch" of files produces the input data that Spark will turn into a micro-batch.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;As example, if you set trigger batch to have maxFilesPerTrigger=10 then Spark will list &lt;STRONG&gt;at most&amp;nbsp;&lt;/STRONG&gt;new files in that trigger. And that set of files will become an input for one micro-batch.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;But keep in mind if fewer than 10 files&amp;nbsp;are available, your micro-batch will be smaller.&lt;/P&gt;&lt;P&gt;As about controlling the size of microbatch - you're just doing that by setting maxFilesPerTrigger and&amp;nbsp;maxBytesPerTrigger. Remeber, these settings will produce "input"&amp;nbsp; that micro-batch will operate on, so hence it directly influence the size of micro-batch.&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/batch-size" target="_blank" rel="noopener"&gt;Configure Structured Streaming batch size on Azure Databricks - Azure Databricks | Microsoft Learn&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 02 Sep 2025 11:05:52 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2025-09-02T11:05:52Z</dc:date>
    <item>
      <title>Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130484#M48807</link>
      <description>&lt;P&gt;Hey everyone,&lt;/P&gt;&lt;P&gt;I’m trying to clarify a confusion in&amp;nbsp;AutoLoader regarding &lt;STRONG&gt;trigger batches&lt;/STRONG&gt; and &lt;STRONG&gt;micro-batches&lt;/STRONG&gt; when using .forEachBatch.&lt;/P&gt;&lt;P&gt;Here’s what I understand so far:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Trigger batch&lt;/STRONG&gt; – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger. This determines how many new files Auto Loader reads &lt;STRONG&gt;per streaming trigger&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Micro-batch in .forEachBatch&lt;/STRONG&gt; – This is the batch of data your callback function receives.&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;My questions are:&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;Are &lt;STRONG&gt;trigger batches and .forEachBatch micro-batches exactly the same thing&lt;/STRONG&gt;?&lt;/P&gt;&lt;P&gt;2. If they are not the same, do they map one on one? For example, if I have maxFilesPerTrigger=10, does each .forEachBatch call &lt;STRONG&gt;always&lt;/STRONG&gt; receive exactly 10 files (if available), or could it receive more or fewer depending on internal Spark scheduling?&lt;/P&gt;&lt;P&gt;3. Can I set the .forEachBatch microbatch size, just as I set the trigger size, or it's internal Spark configuration?&lt;/P&gt;&lt;P&gt;4. Does the trigger type affects any of the upper responses (availableNow, time-scheduled trigger, real-time streaming)?&amp;nbsp;&lt;/P&gt;&lt;P&gt;5. Any suggestions to keep in mind for initial (historic) load?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 10:56:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130484#M48807</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-09-02T10:56:02Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130494#M48808</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by&amp;nbsp;&lt;SPAN&gt;cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2. Micro-batch - this is your unit of data that the query executes on. If you use .forEachBatch, Spark gives your function one micro-batch at a time.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;So, you can think of it in following way: a "trigger batch" of files produces the input data that Spark will turn into a micro-batch.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;As example, if you set trigger batch to have maxFilesPerTrigger=10 then Spark will list &lt;STRONG&gt;at most&amp;nbsp;&lt;/STRONG&gt;new files in that trigger. And that set of files will become an input for one micro-batch.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;But keep in mind if fewer than 10 files&amp;nbsp;are available, your micro-batch will be smaller.&lt;/P&gt;&lt;P&gt;As about controlling the size of microbatch - you're just doing that by setting maxFilesPerTrigger and&amp;nbsp;maxBytesPerTrigger. Remeber, these settings will produce "input"&amp;nbsp; that micro-batch will operate on, so hence it directly influence the size of micro-batch.&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/batch-size" target="_blank" rel="noopener"&gt;Configure Structured Streaming batch size on Azure Databricks - Azure Databricks | Microsoft Learn&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 11:05:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130494#M48808</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-02T11:05:52Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130506#M48812</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;&amp;nbsp;So, their relation is one-on-one? Does one trigger batch &lt;STRONG&gt;always&amp;nbsp;&lt;/STRONG&gt;maps to one micro-batch?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 11:50:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130506#M48812</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-09-02T11:50:14Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130510#M48814</link>
      <description>&lt;P&gt;Yes, you can think in that way about it.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 12:04:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-trigger-batch-vs-micro-batch-as-in-foreachbatch/m-p/130510#M48814</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-02T12:04:22Z</dc:date>
    </item>
  </channel>
</rss>

