Databricks Community

Phani1 · 3 weeks ago

Hi All,

I have a situation where I'm receiving various CSV files in a storage location.

The issue I'm facing is that I'm using Databricks Autoloader, but some files might arrive later than expected. In this case, we need to notify the relevant team about the delayed files.

I'm thinking about how to design a solution for this. One idea I have is to create a control table that contains all the scheduled details. Once the Autoloader processes a file, we can use the metadata to retrieve the file information and the time it arrived in the storage location. Then, we can compare this data against the control table, which holds all the scheduled information, and report to the appropriate team if there are any delays. From what I understand about Databricks Autoloader, the job needs to be scheduled to pick up files from the storage location. Instead of using a classic cluster for scheduling, we could use serverless options to save on costs.

and I am thinking below options /solutions apart from auto loader :

1) How can we achieve this scenario using Databricks' built-in features?
2) We're also considering an ADF event-based approach.
3) Databricks workflows have a file arrival feature, and we can schedule jobs using serverless, but the customer is currently using ADF for scheduling instead of Databricks workflows.

Any inputs on this kindly let me know.

Regards,

Phani

HaggMan · 3 weeks ago

Well, Autoloader could work nicely with the notification event for arriving files. You could probably specify a window duration for your "on-time" arrivels and that could be your base check for on time. As files arrive they go to their window and when the file metadata is ingested with Autoloader, you compare the arrival time to the window + the late watermark. You could trigger alerts for files that come in past the watermark.

I guess some of this would depend on the on-time vs late being a regularly scheduled thing.

Databricks Community

Late file arrivals - Autoloader

Connect with Databricks Users in Your Area

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years

Now Hiring: Databricks Community Technical Moderator

Become Our Next Monthly Community Champion!