topic Re: Shared job clusters on Azure Data Factory ADF in Data Engineering

Shared job clusters on Azure Data Factory ADF

KrzysztofPrzyso — Mon, 12 Feb 2024 17:49:22 GMT

Hi Databricks Community,

If only possible I would like to use Shared Jobs Cluster on external orchestrator like Azure Data Factory (ADF) or Synapse Workspace.
The main reasons for using Shared Job cluster are:

reduction of start-up time (<1min vs 5 min per activity)
reduction of compute cost for the underlying vm
possibly reuse / caching some data

In other words if we have multiple databricks activities being run in sequence for the same data (common practice in medallion architecture) we would like avoid situation that we treat each of them as completely isolated runs.
It is possible in databricks workflows:
https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/use-compute#use-shared-job-clusters
How to Save Time and Costs With Cluster Reuse in Databricks Jobs - The Databricks Blog

Is it possible to use this feature in the external orchestrator like ADF?

I would like to avoid creating custom synchronisation using workflows triggered and status checked via REST API as described here:
Leverage Azure Databricks jobs orchestration from Azure Data Factory - Microsoft Community Hub
or here:
How to orchestrate Databricks jobs from Azure Data Factory using Databricks REST API | Medium

The native databricks ADF connector in my view is almost always the best option. Please consider fact that due to other requirements I am not able to use workflows directly.
I would imagine that by supplying a common attribute, like a 'pipeline().RunId' + 'WaitForNext' flag one could reuse existing cluster

Re: Shared job clusters on Azure Data Factory ADF

saikumar246 — Tue, 13 Feb 2024 12:47:58 GMT

Hi, @KrzysztofPrzyso Thanks for sharing your concern here.

The Shared Jobs Cluster feature in Databricks is specifically designed for tasks within the same job run and is not intended to be shared across different jobs or runs of the same job. This feature is designed to optimize resource usage within a single job run, allowing multiple tasks in the same job run to reuse the cluster. As such, it may not be feasible to utilize the Shared Jobs Cluster feature in an external orchestrator like Azure Data Factory (ADF) or Synapse Workspace to reduce startup time, compute cost, and reuse or cache some data across different job runs.

But if you want to save startup time, reduction of compute cost for the underlying VM and possibly reuse/caching some data on Azure Data Factory, while creating a Databricks link service, you can select the existing interactive cluster or existing instance pool. so for the next task/ job in the run will re-use the same cluster if you have multiple sequences of tasks/jobs.

https://learn.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook#:~:text=Azure%20Databricks%20%2D%20to%20connect%20to%20the%20Databricks%20cluster.

Please have a like if it is helpful for you. Follow-ups are appreciated.

Kudos,

Sai Kumar

Re: Shared job clusters on Azure Data Factory ADF

KrzysztofPrzyso — Thu, 15 Feb 2024 11:36:00 GMT

Hi Sai Kumar,

Many thanks for your response.

Unfortunately using analytical clusters is not really an option for for me due to cost differences between job clusters and analytical clusters.
Job cluster also offer assurance that the latest deployed version of the code (wheel) file is being picked up.

If shared job clusters are not available can you advise some more details about the Cluster Pools and the way to keep VM up.
Create a pool - Azure Databricks | Microsoft Learn
It would be interesting for me to know the best practices of the vms in pools and any other ways to speed up the stratup.
Are there any plans to introduce serverless python clusters similar to serverless SQL warehouses?