Databricks Community

noorbasha534 · ‎09-11-2025

Hello all

We have a pipeline which uses auto-loader to load data from cloud object storage (ADLS) to a delta table. We use directory listing at the moment. And there exist around 20000 folders to be verified in ADLS every 30 mins to check for new data and process into a delta table.

we realize this approach does not process files of some tables (aka folders), resulting into stale tables in the lakehouse.

Is there a way to query the rocksdb to know that there arrived files for say 8000 tables for today (out of 20000), and then we profile on the delta side the last modified date of the table, compare both sides and figure out the stale tables..?

Or, is there another better & fool-proof approach...

Krishna_S · ‎10-02-2025

At your scale (20k folders, 30-min cadence), file notification mode for the autoloader is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads.

Please check this documentation:

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode

View solution in original post

szymon_dybczak · ‎09-11-2025

Hi @noorbasha534

You can try to use cloud_files_state function. It provide SQL API for inspecting state of a stream, so you can find metadata about files that have been discovered by an Auto Loader stream:

SELECT * FROM cloud_files_state('path/to/checkpoint');

Configure Auto Loader for production workloads | Databricks on AWS

noorbasha534 · ‎09-11-2025

@szymon_dybczak My target table is not a stream but regular delta table. I got this error -

[STREAMING_TABLE_OPERATION_NOT_ALLOWED.NON_STREAMING_TABLE] The operation CLOUD_FILES_STATE is not allowed: `catalog_name`.`schema_name`.`table_name` is not a Streaming Table. SQLSTATE: 42601

szymon_dybczak · ‎09-11-2025

Weird, you're using autoloader. Under the hood it uses spark structured streaming, so it should work. Did you provide correct path to the checkpoint location?

Anyway, tomorrow I'll try to run it on my environment. I have similar setup in one of my clients (also autoloader with directory listing mode) and I'm quite sure that this function worked 🙂 I'll keep you updated

noorbasha534 · ‎09-11-2025

@szymon_dybczak ah sorry, let me rephrase. I tried the command initially on the delta table directly. That resulted the error. Then I tried on the check point. It did give me results though discovered on null for all the rows. Still, this does not solve the problem for me - there is a new file that arrived yesterday in cloud object storage (source is raw parquet, not delta) for the delta table (target is delta) at hand but the cloud_files_state does not tell me that this file arrived/discovered. So, it seems as we run this command on the target/delta side, it only tells what was processed?..In my case, seems auto loader does not discover..

Krishna_S · ‎10-02-2025

At your scale (20k folders, 30-min cadence), file notification mode for the autoloader is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads.

Please check this documentation:

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode

BS_THE_ANALYST · ‎10-03-2025

@Krishna_S I didn't know about file detection modes, that's very cool! 😊.

@noorbasha534 according to the documentation, there is a piece around RockDB: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/#how-does-auto-loader-... .. much to @szymon_dybczak's point above, the docs indicate that there should be a checkpoint location.

This is a really interesting post @noorbasha534. I'm certainly following/bookmarking this one 😊.

All the best,
BS

Databricks Community

Figure out stale tables/folders being loaded by auto-loader

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐