3 weeks ago
Hello all
We have a pipeline which uses auto-loader to load data from cloud object storage (ADLS) to a delta table. We use directory listing at the moment. And there exist around 20000 folders to be verified in ADLS every 30 mins to check for new data and process into a delta table.
we realize this approach does not process files of some tables (aka folders), resulting into stale tables in the lakehouse.
Is there a way to query the rocksdb to know that there arrived files for say 8000 tables for today (out of 20000), and then we profile on the delta side the last modified date of the table, compare both sides and figure out the stale tables..?
Or, is there another better & fool-proof approach...
3 weeks ago
You can try to use cloud_files_state function. It provide SQL API for inspecting state of a stream, so you can find metadata about files that have been discovered by an Auto Loader stream:
SELECT * FROM cloud_files_state('path/to/checkpoint');
Configure Auto Loader for production workloads | Databricks on AWS
3 weeks ago
@szymon_dybczak My target table is not a stream but regular delta table. I got this error -
3 weeks ago
Weird, you're using autoloader. Under the hood it uses spark structured streaming, so it should work. Did you provide correct path to the checkpoint location?
Anyway, tomorrow I'll try to run it on my environment. I have similar setup in one of my clients (also autoloader with directory listing mode) and I'm quite sure that this function worked 🙂 I'll keep you updated
3 weeks ago
@szymon_dybczak ah sorry, let me rephrase. I tried the command initially on the delta table directly. That resulted the error. Then I tried on the check point. It did give me results though discovered on null for all the rows. Still, this does not solve the problem for me - there is a new file that arrived yesterday in cloud object storage (source is raw parquet, not delta) for the delta table (target is delta) at hand but the cloud_files_state does not tell me that this file arrived/discovered. So, it seems as we run this command on the target/delta side, it only tells what was processed?..In my case, seems auto loader does not discover..
an hour ago
At your scale (20k folders, 30-min cadence), file notification mode for the autoloader is the recommended approach. Directory listing will keep doing huge LISTs, get throttled, and occasionally miss windows; notifications scale better and are cheaper. Databricks explicitly recommends migrating to file notifications for most workloads.
Please check this documentation:
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now