cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Figure out stale tables/folders being loaded by auto-loader

noorbasha534
Valued Contributor II

Hello all

We have a pipeline which uses auto-loader to load data from cloud object storage (ADLS) to a delta table. We use directory listing at the moment. And there exist around 20000 folders to be verified in ADLS every 30 mins to check for new data and process into a delta table.

we realize this approach does not process files of some tables (aka folders), resulting into stale tables in the lakehouse.

Is there a way to query the rocksdb to know that there arrived files for say 8000 tables for today (out of 20000), and then we profile on the delta side the last modified date of the table, compare both sides and figure out the stale tables..?

Or, is there another better & fool-proof approach...

1 ACCEPTED SOLUTION

Accepted Solutions

At your scale (20k folders, 30-min cadence), file notification mode for the autoloader is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads. 

Please check this documentation:

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode

 

 

View solution in original post

6 REPLIES 6

szymon_dybczak
Esteemed Contributor III

Hi @noorbasha534 

You can try to use cloud_files_state function. It provide SQL API for inspecting state of a stream, so you can find metadata about files that have been discovered by an Auto Loader stream:

 

SELECT * FROM cloud_files_state('path/to/checkpoint');

Configure Auto Loader for production workloads | Databricks on AWS

noorbasha534
Valued Contributor II

@szymon_dybczak My target table is not a stream but regular delta table. I got this error - 

[STREAMING_TABLE_OPERATION_NOT_ALLOWED.NON_STREAMING_TABLE] The operation CLOUD_FILES_STATE is not allowed: `catalog_name`.`schema_name`.`table_name` is not a Streaming Table. SQLSTATE: 42601

 

szymon_dybczak
Esteemed Contributor III

Weird, you're using autoloader. Under the hood it uses spark structured streaming, so it should work. Did you provide correct path to the checkpoint location?

Anyway, tomorrow I'll try to run it on my environment. I have similar setup in one of my clients (also autoloader with directory listing mode) and I'm quite sure that this function worked 🙂 I'll keep you updated

noorbasha534
Valued Contributor II

@szymon_dybczak  ah sorry, let me rephrase. I tried the command initially on the delta table directly. That resulted the error. Then I tried on the check point. It did give me results though discovered on null for all the rows. Still, this does not solve the problem for me - there is a new file that arrived yesterday in cloud object storage (source is raw parquet, not delta) for the delta table (target is delta) at hand but the cloud_files_state does not tell me that this file arrived/discovered. So, it seems as we run this command on the target/delta side, it only tells what was processed?..In my case, seems auto loader does not discover..

At your scale (20k folders, 30-min cadence), file notification mode for the autoloader is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads. 

Please check this documentation:

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode

 

 

BS_THE_ANALYST
Esteemed Contributor II

@Krishna_S I didn't know about file detection modes, that's very cool! 😊.

@noorbasha534 according to the documentation, there is a piece around RockDB: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/#how-does-auto-loader-... .. much to @szymon_dybczak's point above, the docs indicate that there should be a checkpoint location.

BS_THE_ANALYST_0-1759486968857.png

This is a really interesting post @noorbasha534. I'm certainly following/bookmarking this one 😊.

All the best,
BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now