cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks now supports event-driven workloads, especially for loading cloud files from external locations. This means you can save costs and resource...

Hubert-Dudek
Esteemed Contributor III

Databricks now supports event-driven workloads, especially for loading cloud files from external locations. This means you can save costs and resources by triggering your Databricks jobs only when new files arrive in your cloud storage instead of mounting it as DBFS and polling it periodically. To use this feature, you need to follow these steps:

  • Add an external location for your ADLS2 container,
  • Make sure the storage credentials you use (such as Access Connector, service principal, or managed identity) have Storage Blob Data Contributor permissions for that container,
  • Make sure the account you use to run your workload has at least read files permission for the external location,
  • Write a notebook that loads cloud files from the external location,
  • Set a file arrival trigger for your workflow and specify the exact external location as the source.

With these steps, you can easily create and run event-driven workloads on Databricks.

ezgif-3-946af786d0

10 REPLIES 10

Kaniz
Community Manager
Community Manager

Hi @Hubert Dudek​, We are truly grateful for the informative content you've shared with our community.

Your dedication to providing valuable insights has not gone unnoticed, and it has significantly enriched all members’ discussions and learning experiences.

Please accept our heartfelt appreciation for your time and effort. Keep up the fantastic work, and we look forward to your future contributions!

Salesforce
New Contributor II

Hey,

We have a use case where we have Salesforce generating Change Data Capture (CDC) platform events. With this new event driven workload, can Databricks directly consume these CDC events from Salesforce?

We are currently also evaluating a middleware like Mulesoft are directed in this reference article: Subscribe to Change Data Capture Events with the Salesforce Connector. However we are concerned about the pricing of Mulesoft.

-werners-
Esteemed Contributor III

I think we are talking about file events here.
What you are talking about is in fact streaming ingest from a CDC system.  That can be done but not by directly connecting to the CDC.  You can forward the CDC events to a event queue like Kafka etc, and let Spark subscribe to one of those topics.
Mule soft probably works too, but honestly as you already mentioned, it is overpriced.
What is presented here was already possible in many other systems, but is now also added in Databricks.

Floody
New Contributor II

while this works great with new files is it possible to trigger when update happens to existing file?

-werners-
Esteemed Contributor III

the event triggers on file events in blob storage, which are typically immutable, meaning files cannot be updated, only created or deleted, overwritten.

Floody
New Contributor II

Yes, the file is getting overwritten, but trigger is not happening. Maybe I am missing something?

-werners-
Esteemed Contributor III

probably the event is not triggered by an overwrite, can you test with delete followed by a created?

adriennn
Contributor

For reference, the trigger will not contain any information on the event itself (like file names etc), so you cannot build a dynamic event-driven architecture with this trigger.

daniel_sahal
Esteemed Contributor

@adriennn 
That's because it's only one of the trigger types. To load newly arrived files automatically you can utilize AutoLoader.

adriennn
Contributor

@daniel_sahal I get your point, but if for a scheduled trigger you can get all kind of attributes on the trigger time (arguably, this is available for all the triggers), then why wouldn't the most important attribute of a file event not be available through the trigger?

What I'm thinking is something like:
job.trigger.file_arrival.file_path, job.trigger.file_arrival.parent_folder,  etc.

adriennn_0-1714136125829.png

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.