<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/157832#M54626</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/94123"&gt;@johschmidt42&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;This is a great question, but the mystery actually lies in the very first line of your read configuration: spark_session.readStream.format(source="delta")&lt;/P&gt;&lt;P&gt;Because you are using .format("delta") instead of .format("cloudFiles"), you are actually using &lt;STRONG&gt;native Delta Structured Streaming&lt;/STRONG&gt;, not Auto Loader!&lt;/P&gt;&lt;P&gt;Here is exactly why you saw that behavior:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why cloudFiles was ignored:&lt;/STRONG&gt; Spark silently ignores options that don't apply to the chosen format.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why maxFilesPerTrigger worked:&lt;/STRONG&gt; That is the correct, native option for controlling rate limits in a standard Delta stream.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;The good news? You accidentally did it the right way!&lt;/STRONG&gt; Since your source data is already in Delta format, using native Delta streaming (.format("delta")) is much more efficient than using Auto Loader (which is meant for raw files like CSV/JSON).&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 1&lt;/STRONG&gt; : To clean up your code, you can safely remove the cloudFiles options entirely. Here is the idiomatic way to write it:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df: DataFrame = (
    spark_session.readStream
    .format("delta")
    .option("maxFilesPerTrigger", 10)
    .load(table_path)
    .select("*", col("_metadata.file_path").alias("source_file"))
)

df.writeStream \
  .trigger(availableNow=True) \
  .foreachBatch(process_batch) \
  .start()&lt;/LI-CODE&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;Option 2: Reading raw files using Auto Loader&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;LI-CODE lang="python"&gt;df: DataFrame = (
    spark_session.readStream
    .format("cloudFiles") # This invokes Auto Loader
    .option("cloudFiles.format", "parquet") # Must be a raw file format
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .option("cloudFiles.maxFilesPerTrigger", 10) # Now this works!
    .load(raw_files_path)
    .select("*", col("_metadata.file_path").alias("source_file"))
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 28 May 2026 22:56:36 GMT</pubDate>
    <dc:creator>ShamenParis</dc:creator>
    <dc:date>2026-05-28T22:56:36Z</dc:date>
    <item>
      <title>Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)?</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/112798#M44333</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm using the &lt;STRONG&gt;Auto Loader&lt;/STRONG&gt; feature to read streaming data from Delta Lake files and process them in a batch. The trigger is set to &lt;STRONG&gt;availableNow&lt;/STRONG&gt; to include all new data from the checkpoint offset but I limit the amount of delta files for the batch to be 10 using the &lt;STRONG&gt;cloudFiles.maxFilesPerTrigger&lt;/STRONG&gt; option. However, the&amp;nbsp; `process_batch` function always reports that it receives the default 1000 files for its batch. Am I misinterpreting the options here?&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import col


def process_batch(df: DataFrame, batch_id: int) -&amp;gt; None:
    batch_id: int = batch_id
    num_files: int = df.select("source_file").distinct().count()
    num_rows_total: int = df.count()

    print(
        f"Batch: '{batch_id}' - Processing {num_files:,} delta files with {num_rows_total:,} rows."
    )


spark_session: SparkSession = SparkSession.getActiveSession()

checkpoint_path: str = "/Volumes/checkpoint_path"
table_path: str = "/Volumes/table_path"

df: DataFrame = (
    spark_session.readStream.format(source="delta")
    .option(key="cloudFiles.format", value="delta")
    .option(key="cloudFiles.schemaLocation", value=checkpoint_path)
    .option(key="cloudFiles.maxFilesPerTrigger", value=10)
    .load(path=table_path)
    .select("*", col("_metadata.file_path").alias("source_file"))
)

df.writeStream.trigger(
    availableNow=True
).foreachBatch(func=process_batch).start()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Mar 2025 12:26:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/112798#M44333</guid>
      <dc:creator>johschmidt42</dc:creator>
      <dc:date>2025-03-17T12:26:23Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)?</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/112846#M44349</link>
      <description>&lt;P&gt;It works when changing "cloudFiles.maxFilesPerTrigger" to "maxFilesPerTrigger". But this is unexpected..&lt;/P&gt;</description>
      <pubDate>Mon, 17 Mar 2025 20:55:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/112846#M44349</guid>
      <dc:creator>johschmidt42</dc:creator>
      <dc:date>2025-03-17T20:55:36Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)?</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/113687#M44609</link>
      <description>&lt;P&gt;In doc it is: "&lt;SPAN&gt;cloudFiles.maxFilesPerTrigger" &lt;span class="lia-unicode-emoji" title=":confused_face:"&gt;😕&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options&lt;/A&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 15:40:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/113687#M44609</guid>
      <dc:creator>p_romm</dc:creator>
      <dc:date>2025-03-26T15:40:33Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)?</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/157823#M54622</link>
      <description>&lt;P&gt;For others who run into this issue:&lt;/P&gt;&lt;P&gt;Changing `cloudFiles.maxFilesPerTrigger` to `maxFilesPerTrigger` is not the solution. Check your checkpoint state first.&lt;/P&gt;&lt;P&gt;If a previous Auto Loader run failed or was cancelled after files had already been discovered/planned, the checkpoint can retain that state. On a later run, changing `cloudFiles.maxFilesPerTrigger` may appear to be ignored because the stream is still working through the files that were already discovered under the previous configuration.&lt;/P&gt;&lt;P&gt;For example, if a run used the default/high value for `cloudFiles.maxFilesPerTrigger`, and then failed inside `foreachBatch` or was cancelled, the checkpoint may already contain a planned batch of files. If you later change the configuration to:&lt;/P&gt;&lt;P&gt;```python&lt;BR /&gt;.option("cloudFiles.maxFilesPerTrigger", "1")&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;the next run may still process the previously discovered files as one larger batch, making it look like `cloudFiles.maxFilesPerTrigger` is not being respected.&lt;/P&gt;&lt;P&gt;The cleanest fix is to use a new checkpoint.&lt;/P&gt;&lt;P&gt;If using a new checkpoint is not an option because it would cause a large historical reprocess, another workaround is:&lt;/P&gt;&lt;P&gt;1. Move the affected files out of the Auto Loader source path.&lt;BR /&gt;2. Rename them so they will be treated as new files later.&lt;BR /&gt;3. Run the stream once with the existing checkpoint. It should complete without processing those files.&lt;BR /&gt;4. Move the renamed files back into the source path.&lt;BR /&gt;5. Run the stream again with the desired setting, for example:&lt;/P&gt;&lt;P&gt;```python&lt;BR /&gt;.option("cloudFiles.maxFilesPerTrigger", "1")&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;At that point, Auto Loader should discover the renamed files as new files and respect the configured rate limit.&lt;/P&gt;</description>
      <pubDate>Thu, 28 May 2026 19:03:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/157823#M54622</guid>
      <dc:creator>Juan</dc:creator>
      <dc:date>2026-05-28T19:03:01Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader cloudFiles.maxFilesPerTrigger ignored with .trigger(availableNow=True)?</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/157832#M54626</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/94123"&gt;@johschmidt42&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;This is a great question, but the mystery actually lies in the very first line of your read configuration: spark_session.readStream.format(source="delta")&lt;/P&gt;&lt;P&gt;Because you are using .format("delta") instead of .format("cloudFiles"), you are actually using &lt;STRONG&gt;native Delta Structured Streaming&lt;/STRONG&gt;, not Auto Loader!&lt;/P&gt;&lt;P&gt;Here is exactly why you saw that behavior:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why cloudFiles was ignored:&lt;/STRONG&gt; Spark silently ignores options that don't apply to the chosen format.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why maxFilesPerTrigger worked:&lt;/STRONG&gt; That is the correct, native option for controlling rate limits in a standard Delta stream.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;The good news? You accidentally did it the right way!&lt;/STRONG&gt; Since your source data is already in Delta format, using native Delta streaming (.format("delta")) is much more efficient than using Auto Loader (which is meant for raw files like CSV/JSON).&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 1&lt;/STRONG&gt; : To clean up your code, you can safely remove the cloudFiles options entirely. Here is the idiomatic way to write it:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df: DataFrame = (
    spark_session.readStream
    .format("delta")
    .option("maxFilesPerTrigger", 10)
    .load(table_path)
    .select("*", col("_metadata.file_path").alias("source_file"))
)

df.writeStream \
  .trigger(availableNow=True) \
  .foreachBatch(process_batch) \
  .start()&lt;/LI-CODE&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;P&gt;Option 2: Reading raw files using Auto Loader&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;LI-CODE lang="python"&gt;df: DataFrame = (
    spark_session.readStream
    .format("cloudFiles") # This invokes Auto Loader
    .option("cloudFiles.format", "parquet") # Must be a raw file format
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .option("cloudFiles.maxFilesPerTrigger", 10) # Now this works!
    .load(raw_files_path)
    .select("*", col("_metadata.file_path").alias("source_file"))
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 May 2026 22:56:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-cloudfiles-maxfilespertrigger-ignored-with-trigger/m-p/157832#M54626</guid>
      <dc:creator>ShamenParis</dc:creator>
      <dc:date>2026-05-28T22:56:36Z</dc:date>
    </item>
  </channel>
</rss>

