Hello all
We have a pipeline which uses auto-loader to load data from cloud object storage (ADLS) to a delta table. We use directory listing at the moment. And there exist around 20000 folders to be verified in ADLS every 30 mins to check for new data and process into a delta table.
we realize this approach does not process files of some tables (aka folders), resulting into stale tables in the lakehouse.
Is there a way to query the rocksdb to know that there arrived files for say 8000 tables for today (out of 20000), and then we profile on the delta side the last modified date of the table, compare both sides and figure out the stale tables..?
Or, is there another better & fool-proof approach...