cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Late file arrivals - Autoloader

Phani1
Valued Contributor II

 

Hi All,

I have a situation where I'm receiving various CSV files in a storage location.

The issue I'm facing is that I'm using Databricks Autoloader, but some files might arrive later than expected. In this case, we need to notify the relevant team about the delayed files.

I'm thinking about how to design a solution for this. One idea I have is to create a control table that contains all the scheduled details. Once the Autoloader processes a file, we can use the metadata to retrieve the file information and the time it arrived in the storage location. Then, we can compare this data against the control table, which holds all the scheduled information, and report to the appropriate team if there are any delays. From what I understand about Databricks Autoloader, the job needs to be scheduled to pick up files from the storage location. Instead of using a classic cluster for scheduling, we could use serverless options to save on costs. 

 and I am thinking below options /solutions apart from auto loader :

1) How can we achieve this scenario using Databricks' built-in features?
2) We're also considering an ADF event-based approach.
3) Databricks workflows have a file arrival feature, and we can schedule jobs using serverless, but the customer is currently using ADF for scheduling instead of Databricks workflows.

Any inputs on this kindly let me know.

Regards,

Phani

 

1 REPLY 1

HaggMan
New Contributor III

Well, Autoloader could work nicely with the notification event for arriving files. You could probably specify a window duration for your "on-time" arrivels and that could be your base check for on time. As files arrive they go to their window and when the file metadata is ingested with Autoloader, you compare the arrival time to the window + the late watermark.  You could trigger alerts for files that come in past the watermark. 

I guess some of this would depend on the on-time vs late being a regularly scheduled thing. 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group