<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Figure out stale tables/folders being loaded by auto-loader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/133601#M49890</link>
    <description>&lt;P&gt;At your scale (20k folders, 30-min cadence), &lt;STRONG data-start="63" data-end="89"&gt;file notification mode for the autoloader&lt;/STRONG&gt; is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please check this documentation:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 03 Oct 2025 05:27:35 GMT</pubDate>
    <dc:creator>Krishna_S</dc:creator>
    <dc:date>2025-10-03T05:27:35Z</dc:date>
    <item>
      <title>Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131708#M49202</link>
      <description>&lt;P&gt;Hello all&lt;/P&gt;&lt;P&gt;We have a pipeline which uses auto-loader to load data from cloud object storage (ADLS) to a delta table. We use directory listing at the moment. And there exist around 20000 folders to be verified in ADLS every 30 mins to check for new data and process into a delta table.&lt;/P&gt;&lt;P&gt;we realize this approach does not process files of some tables (aka folders), resulting into stale tables in the lakehouse.&lt;/P&gt;&lt;P&gt;Is there a way to query the rocksdb to know that there arrived files for say 8000 tables for today (out of 20000), and then we profile on the delta side the last modified date of the table, compare both sides and figure out the stale tables..?&lt;/P&gt;&lt;P&gt;Or, is there another better &amp;amp; fool-proof approach...&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 19:56:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131708#M49202</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-09-11T19:56:49Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131711#M49203</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can try to use cloud_files_state function. It provide SQL API for inspecting state of a stream, so&amp;nbsp;&lt;SPAN&gt;you can find metadata about files that have been discovered by an&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Auto Loader&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;stream:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT * FROM cloud_files_state('path/to/checkpoint');&lt;/LI-CODE&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/production#querying-files-discovered-by-auto-loader" target="_blank"&gt;Configure Auto Loader for production workloads | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 20:07:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131711#M49203</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-11T20:07:47Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131713#M49204</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;My target table is not a stream but regular delta table. I got this error -&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;[STREAMING_TABLE_OPERATION_NOT_ALLOWED.NON_STREAMING_TABLE] The operation CLOUD_FILES_STATE is not allowed: `catalog_name`.`schema_name`.`table_name` is not a Streaming Table. SQLSTATE: 42601&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 20:22:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131713#M49204</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-09-11T20:22:08Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131715#M49205</link>
      <description>&lt;P&gt;Weird, you're using autoloader. Under the hood it uses spark structured streaming, so it should work. Did you provide correct path to the checkpoint location?&lt;BR /&gt;&lt;BR /&gt;Anyway, tomorrow I'll try to run it on my environment. I have similar setup in one of my clients (also autoloader with directory listing mode) and I'm quite sure that this function worked &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; I'll keep you updated&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 20:26:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131715#M49205</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-11T20:26:30Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131717#M49207</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp; ah sorry, let me rephrase. I tried the command initially on the delta table directly. That resulted the error. Then I tried on the check point. It did give me results though discovered on null for all the rows. Still, this does not solve the problem for me - there is a new file that arrived yesterday in cloud object storage (source is raw parquet, not delta) for the delta table (target is delta) at hand but the cloud_files_state does not tell me that this file arrived/discovered. So, it seems as we run this command on the target/delta side, it only tells what was processed?..In my case, seems auto loader does not discover..&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 20:56:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/131717#M49207</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-09-11T20:56:12Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/133601#M49890</link>
      <description>&lt;P&gt;At your scale (20k folders, 30-min cadence), &lt;STRONG data-start="63" data-end="89"&gt;file notification mode for the autoloader&lt;/STRONG&gt; is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please check this documentation:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Oct 2025 05:27:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/133601#M49890</guid>
      <dc:creator>Krishna_S</dc:creator>
      <dc:date>2025-10-03T05:27:35Z</dc:date>
    </item>
    <item>
      <title>Re: Figure out stale tables/folders being loaded by auto-loader</title>
      <link>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/133652#M49895</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/36904"&gt;@Krishna_S&lt;/a&gt;&amp;nbsp;I didn't know about &lt;EM&gt;file detection modes&lt;/EM&gt;, that's very cool! &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;according to the documentation, there is a piece around RockDB:&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/#how-does-auto-loader-track-ingestion-progress" target="_blank"&gt;https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/#how-does-auto-loader-track-ingestion-progress&lt;/A&gt;&amp;nbsp;.. much to&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;'s point above, the docs indicate that there should be a checkpoint location.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_0-1759486968857.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20382iF77C6D042BE588F7/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_0-1759486968857.png" alt="BS_THE_ANALYST_0-1759486968857.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;This is a really interesting post&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;. I'm certainly following/bookmarking this one &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Fri, 03 Oct 2025 10:25:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/figure-out-stale-tables-folders-being-loaded-by-auto-loader/m-p/133652#M49895</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-10-03T10:25:39Z</dc:date>
    </item>
  </channel>
</rss>

