Databricks Community

Hubert-Dudek · ‎03-14-2023

Databricks now supports event-driven workloads, especially for loading cloud files from external locations. This means you can save costs and resources by triggering your Databricks jobs only when new files arrive in your cloud storage instead of mounting it as DBFS and polling it periodically. To use this feature, you need to follow these steps:

Add an external location for your ADLS2 container,
Make sure the storage credentials you use (such as Access Connector, service principal, or managed identity) have Storage Blob Data Contributor permissions for that container,
Make sure the account you use to run your workload has at least read files permission for the external location,
Write a notebook that loads cloud files from the external location,
Set a file arrival trigger for your workflow and specify the exact external location as the source.

With these steps, you can easily create and run event-driven workloads on Databricks.

Salesforce · ‎03-19-2024

Hey,

We have a use case where we have Salesforce generating Change Data Capture (CDC) platform events. With this new event driven workload, can Databricks directly consume these CDC events from Salesforce?

We are currently also evaluating a middleware like Mulesoft are directed in this reference article: Subscribe to Change Data Capture Events with the Salesforce Connector. However we are concerned about the pricing of Mulesoft.

-werners- · ‎03-20-2024

I think we are talking about file events here.
What you are talking about is in fact streaming ingest from a CDC system. That can be done but not by directly connecting to the CDC. You can forward the CDC events to a event queue like Kafka etc, and let Spark subscribe to one of those topics.
Mule soft probably works too, but honestly as you already mentioned, it is overpriced.
What is presented here was already possible in many other systems, but is now also added in Databricks.

Floody · ‎03-21-2024

while this works great with new files is it possible to trigger when update happens to existing file?

-werners- · ‎03-25-2024

the event triggers on file events in blob storage, which are typically immutable, meaning files cannot be updated, only created or deleted, overwritten.

Floody · ‎03-25-2024

Yes, the file is getting overwritten, but trigger is not happening. Maybe I am missing something?

-werners- · ‎03-26-2024

probably the event is not triggered by an overwrite, can you test with delete followed by a created?

adriennn · ‎04-26-2024

For reference, the trigger will not contain any information on the event itself (like file names etc), so you cannot build a dynamic event-driven architecture with this trigger.

daniel_sahal · ‎04-26-2024

@adriennn
That's because it's only one of the trigger types. To load newly arrived files automatically you can utilize AutoLoader.

adriennn · ‎04-26-2024

@daniel_sahal I get your point, but if for a scheduled trigger you can get all kind of attributes on the trigger time (arguably, this is available for all the triggers), then why wouldn't the most important attribute of a file event not be available through the trigger?

What I'm thinking is something like:
job.trigger.file_arrival.file_path, job.trigger.file_arrival.parent_folder, etc.

Databricks Community

Databricks now supports event-driven workloads, especially for loading cloud files from external locations. This means you can save costs and resource...

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs