<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Questions on Auto Loader auto Listing Logic in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152229#M53789</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m investigating some performance patterns in our Auto Loader (S3) pipelines and would like to clarify the internal listing logic.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Context:&lt;/STRONG&gt; We run a batch job every hour using Auto Loader. Recently, after &lt;STRONG&gt;March 10th&lt;/STRONG&gt;, we noticed our execution time jumped from &lt;STRONG&gt;1 minute to over 5 minutes&lt;/STRONG&gt;. I've confirmed from the &lt;STRONG&gt;March 10 release notes&lt;/STRONG&gt; that the default value for useIncrementalListing was changed to false, which explains the sudden performance drop. Explicitly setting it to true resolved this issue.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The Mystery (Periodic Spikes):&lt;/STRONG&gt; However, looking at the data &lt;I&gt;before&lt;/I&gt; March 10th (when auto or true was the default), I noticed a consistent pattern: execution times increased significantly every 8 hours at &lt;STRONG&gt;UTC 04:00, 12:00, and 20:00&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;My Questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Does cloudFiles.fullDirectoryScanInterval actually exist?&lt;/STRONG&gt; I’ve heard this option controls the interval for full scans when using useIncrementalListing = "auto". Is this a valid/supported configuration?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Is the default interval 8 hours?&lt;/STRONG&gt; The UTC 04, 12, 20 pattern is too consistent to be a coincidence. I'd like to know if Auto Loader is hard-coded (or defaulted) to perform a "Full Listing" every 8 hours even when in incremental mode.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Internal Logic of "auto":&lt;/STRONG&gt; How exactly does Auto Loader decide when to perform a full vs. incremental scan when set to "auto"? Is it purely time-based, or does it depend on other factors?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;I&gt;P.S. I am aware that Databricks recommends File Events for production, but due to cost and the lack of real-time requirements, we prefer the 1-hour batch interval approach.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;Looking forward to your insights!&lt;/P&gt;</description>
    <pubDate>Fri, 27 Mar 2026 01:31:37 GMT</pubDate>
    <dc:creator>JIWON</dc:creator>
    <dc:date>2026-03-27T01:31:37Z</dc:date>
    <item>
      <title>Questions on Auto Loader auto Listing Logic</title>
      <link>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152229#M53789</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m investigating some performance patterns in our Auto Loader (S3) pipelines and would like to clarify the internal listing logic.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Context:&lt;/STRONG&gt; We run a batch job every hour using Auto Loader. Recently, after &lt;STRONG&gt;March 10th&lt;/STRONG&gt;, we noticed our execution time jumped from &lt;STRONG&gt;1 minute to over 5 minutes&lt;/STRONG&gt;. I've confirmed from the &lt;STRONG&gt;March 10 release notes&lt;/STRONG&gt; that the default value for useIncrementalListing was changed to false, which explains the sudden performance drop. Explicitly setting it to true resolved this issue.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The Mystery (Periodic Spikes):&lt;/STRONG&gt; However, looking at the data &lt;I&gt;before&lt;/I&gt; March 10th (when auto or true was the default), I noticed a consistent pattern: execution times increased significantly every 8 hours at &lt;STRONG&gt;UTC 04:00, 12:00, and 20:00&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;My Questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Does cloudFiles.fullDirectoryScanInterval actually exist?&lt;/STRONG&gt; I’ve heard this option controls the interval for full scans when using useIncrementalListing = "auto". Is this a valid/supported configuration?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Is the default interval 8 hours?&lt;/STRONG&gt; The UTC 04, 12, 20 pattern is too consistent to be a coincidence. I'd like to know if Auto Loader is hard-coded (or defaulted) to perform a "Full Listing" every 8 hours even when in incremental mode.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Internal Logic of "auto":&lt;/STRONG&gt; How exactly does Auto Loader decide when to perform a full vs. incremental scan when set to "auto"? Is it purely time-based, or does it depend on other factors?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;I&gt;P.S. I am aware that Databricks recommends File Events for production, but due to cost and the lack of real-time requirements, we prefer the 1-hour batch interval approach.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;Looking forward to your insights!&lt;/P&gt;</description>
      <pubDate>Fri, 27 Mar 2026 01:31:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152229#M53789</guid>
      <dc:creator>JIWON</dc:creator>
      <dc:date>2026-03-27T01:31:37Z</dc:date>
    </item>
    <item>
      <title>Re: Questions on Auto Loader auto Listing Logic</title>
      <link>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152291#M53806</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/182410"&gt;@JIWON&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;1. There is no such option;&lt;/P&gt;
&lt;P&gt;2. Assuming that the job is triggered every hour, the spikes every 8-hours can be explained by &lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/directory-listing-mode#incremental-listing-deprecated" target="_blank" rel="noopener"&gt;this&lt;/A&gt;:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;To ensure eventual completeness of data in&amp;nbsp;&lt;CODE&gt;auto&lt;/CODE&gt;&amp;nbsp;mode,&amp;nbsp;Auto Loader&amp;nbsp;automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting&amp;nbsp;&lt;CODE&gt;cloudFiles.backfillInterval&lt;/CODE&gt;&amp;nbsp;to trigger asynchronous backfills at a given interval.&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;3. So, if you want to reduce / increase the full scan frequency, you can set up an interval with the&amp;nbsp;&lt;SPAN&gt;&lt;CODE&gt;cloudFiles.backfillInterval&lt;/CODE&gt; option, for example &lt;CODE&gt;.option("cloudFiles.backfillInterval", "1 week")&lt;/CODE&gt;. Just bear in mind that the full listing is needed to include any missed files, so doing it more rarely means that there will be potentially some missed data.&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Hope it helps.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;P.S. Really curious to understand your requirements for real-time which are not compatible with the File events mode. You would still be able to run job every hour (and not in real-time) with File events mode.&lt;/P&gt;
&lt;P&gt;Best regards,&lt;/P&gt;</description>
      <pubDate>Fri, 27 Mar 2026 10:42:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152291#M53806</guid>
      <dc:creator>aleksandra_ch</dc:creator>
      <dc:date>2026-03-27T10:42:34Z</dc:date>
    </item>
    <item>
      <title>Re: Questions on Auto Loader auto Listing Logic</title>
      <link>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152486#M53830</link>
      <description>&lt;P&gt;Hi aleksandra_ch,&lt;/P&gt;&lt;P&gt;Thank you so much for the detailed explanation! I feel a bit embarrassed realizing I hadn't thoroughly checked the documentation before asking.&lt;/P&gt;&lt;P&gt;As you pointed out, since my Auto Loader runs as an hourly batch, the "7 incremental + 1 full listing" logic perfectly explains why I was seeing performance spikes every 8 hours. After discovering that the default for useIncrementalListing was changed to false in the March 10 release, I explicitly set it to true, and the issue has been resolved.&lt;/P&gt;&lt;P&gt;I am aware that using incremental listing alone carries a risk of missing files. However, given that our S3 data is Hive-partitioned (year/month/day/hour) and the filenames themselves include timestamps, the risk seems low—though I agree it's not 100% foolproof.&lt;/P&gt;&lt;P&gt;Also, your P.S. was a real eye-opener! I had always associated "File Events" mode exclusively with real-time streaming, so I hadn't even explored using it for our hourly batches. I'll definitely look into implementing that to see if it provides better stability for our pipeline.&lt;/P&gt;&lt;P&gt;Thank you again for your help and for sharing such great insights.&lt;/P&gt;&lt;P&gt;Best regards, Jiwon&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2026 08:57:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/questions-on-auto-loader-auto-listing-logic/m-p/152486#M53830</guid>
      <dc:creator>JIWON</dc:creator>
      <dc:date>2026-03-30T08:57:46Z</dc:date>
    </item>
  </channel>
</rss>

