Hi All,
I have a situation where I'm receiving various CSV files in a storage location.
The issue I'm facing is that I'm using Databricks Autoloader, but some files might arrive later than expected. In this case, we need to notify the relevant team about the delayed files.
I'm thinking about how to design a solution for this. One idea I have is to create a control table that contains all the scheduled details. Once the Autoloader processes a file, we can use the metadata to retrieve the file information and the time it arrived in the storage location. Then, we can compare this data against the control table, which holds all the scheduled information, and report to the appropriate team if there are any delays. From what I understand about Databricks Autoloader, the job needs to be scheduled to pick up files from the storage location. Instead of using a classic cluster for scheduling, we could use serverless options to save on costs.
and I am thinking below options /solutions apart from auto loader :
1) How can we achieve this scenario using Databricks' built-in features?
2) We're also considering an ADF event-based approach.
3) Databricks workflows have a file arrival feature, and we can schedule jobs using serverless, but the customer is currently using ADF for scheduling instead of Databricks workflows.
Any inputs on this kindly let me know.
Regards,
Phani