<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Shared job clusters on Azure Data Factory ADF in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/60049#M31574</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/7316"&gt;@KrzysztofPrzyso&lt;/a&gt;&amp;nbsp;Thanks for sharing your concern here.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The Shared Jobs Cluster feature in Databricks is specifically designed for tasks within the same job run and is not intended to be shared across different jobs or runs of the same job. This feature is designed to optimize resource usage within a single job run, allowing multiple tasks in the same job run to reuse the cluster. As such, it may not be feasible to utilize the Shared Jobs Cluster feature in an external orchestrator like Azure Data Factory (ADF) or Synapse Workspace to reduce startup time, compute cost, and reuse or cache some data across different job runs.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;But if you want to save startup time,&amp;nbsp;&lt;/SPAN&gt;reduction of compute cost for the underlying VM and possibly reuse/caching some data on Azure Data Factory, while creating a Databricks link service, you can select the existing interactive cluster or existing instance pool. so for the next task/ job in the run will re-use the same cluster if you have multiple sequences of tasks/jobs.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook#:~:text=Azure%20Databricks%20%2D%20to%20connect%20to%20the%20Databricks%20cluster" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook#:~:text=Azure%20Databricks%20%2D%20to%20connect%20to%20the%20Databricks%20cluster&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Please have a like if it is helpful for you. Follow-ups are appreciated.&lt;/P&gt;
&lt;P&gt;Kudos,&lt;/P&gt;
&lt;P&gt;Sai Kumar&lt;/P&gt;</description>
    <pubDate>Tue, 13 Feb 2024 12:47:58 GMT</pubDate>
    <dc:creator>saikumar246</dc:creator>
    <dc:date>2024-02-13T12:47:58Z</dc:date>
    <item>
      <title>Shared job clusters on Azure Data Factory ADF</title>
      <link>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/59971#M31552</link>
      <description>&lt;P&gt;Hi Databricks Community,&lt;/P&gt;&lt;P&gt;If only possible I would like to use Shared Jobs Cluster on external orchestrator like Azure Data Factory (ADF) or Synapse Workspace.&lt;BR /&gt;The main reasons for using Shared Job cluster are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;reduction of start-up time (&amp;lt;1min vs 5 min per activity)&lt;/LI&gt;&lt;LI&gt;reduction of compute cost for the underlying vm&lt;/LI&gt;&lt;LI&gt;possibly reuse / caching some data&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In other words if we have multiple databricks activities being run in sequence for the same data (common practice in medallion architecture) we would like avoid situation that we treat each of them as completely isolated runs.&lt;BR /&gt;It is possible in databricks workflows:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/use-compute#use-shared-job-clusters" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/use-compute#use-shared-job-clusters&lt;/A&gt;&lt;BR /&gt;&lt;A href="https://www.databricks.com/blog/2022/02/04/saving-time-and-costs-with-cluster-reuse-in-databricks-jobs.html?utm_source=microsoft&amp;amp;utm_medium=partner&amp;amp;utm_campaign=7013f000000LjssAAC" target="_blank" rel="noopener"&gt;How to Save Time and Costs With Cluster Reuse in Databricks Jobs - The Databricks Blog&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Is it possible to use this feature in the external orchestrator like ADF?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I would like to avoid creating custom synchronisation using workflows triggered and status checked via REST API as described here:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://techcommunity.microsoft.com/t5/analytics-on-azure-blog/leverage-azure-databricks-jobs-orchestration-from-azure-data/ba-p/3123862" target="_blank" rel="noopener"&gt;Leverage Azure Databricks jobs orchestration from Azure Data Factory - Microsoft Community Hub&lt;/A&gt;&lt;BR /&gt;or here:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://medium.com/@ivangomezarnedo/how-to-orchestrate-databricks-jobs-from-azure-data-factory-using-databricks-rest-api-4d5e8c577581" target="_blank" rel="noopener"&gt;How to orchestrate Databricks jobs from Azure Data Factory using Databricks REST API | Medium&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The native databricks ADF connector in my view is almost always the best option. Please consider fact that due to other requirements I am not able to use workflows directly.&lt;BR /&gt;I would imagine that by supplying a common attribute, like a 'pipeline().RunId' + 'WaitForNext' flag one could reuse existing cluster&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 12 Feb 2024 17:49:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/59971#M31552</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-02-12T17:49:22Z</dc:date>
    </item>
    <item>
      <title>Re: Shared job clusters on Azure Data Factory ADF</title>
      <link>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/60049#M31574</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/7316"&gt;@KrzysztofPrzyso&lt;/a&gt;&amp;nbsp;Thanks for sharing your concern here.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The Shared Jobs Cluster feature in Databricks is specifically designed for tasks within the same job run and is not intended to be shared across different jobs or runs of the same job. This feature is designed to optimize resource usage within a single job run, allowing multiple tasks in the same job run to reuse the cluster. As such, it may not be feasible to utilize the Shared Jobs Cluster feature in an external orchestrator like Azure Data Factory (ADF) or Synapse Workspace to reduce startup time, compute cost, and reuse or cache some data across different job runs.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;But if you want to save startup time,&amp;nbsp;&lt;/SPAN&gt;reduction of compute cost for the underlying VM and possibly reuse/caching some data on Azure Data Factory, while creating a Databricks link service, you can select the existing interactive cluster or existing instance pool. so for the next task/ job in the run will re-use the same cluster if you have multiple sequences of tasks/jobs.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook#:~:text=Azure%20Databricks%20%2D%20to%20connect%20to%20the%20Databricks%20cluster" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook#:~:text=Azure%20Databricks%20%2D%20to%20connect%20to%20the%20Databricks%20cluster&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Please have a like if it is helpful for you. Follow-ups are appreciated.&lt;/P&gt;
&lt;P&gt;Kudos,&lt;/P&gt;
&lt;P&gt;Sai Kumar&lt;/P&gt;</description>
      <pubDate>Tue, 13 Feb 2024 12:47:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/60049#M31574</guid>
      <dc:creator>saikumar246</dc:creator>
      <dc:date>2024-02-13T12:47:58Z</dc:date>
    </item>
    <item>
      <title>Re: Shared job clusters on Azure Data Factory ADF</title>
      <link>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/60305#M31630</link>
      <description>&lt;P&gt;Hi&amp;nbsp;Sai Kumar,&lt;/P&gt;&lt;P&gt;Many thanks for your response.&lt;/P&gt;&lt;P&gt;Unfortunately using analytical clusters is not really an option for for me due to cost differences between job clusters and analytical clusters.&lt;BR /&gt;Job cluster also offer assurance that the latest deployed version of the code (wheel) file is being picked up.&lt;/P&gt;&lt;P&gt;If shared job clusters are not available can you advise some more details about the Cluster Pools and the way to keep VM up.&lt;BR /&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/compute/pools" target="_blank"&gt;Create a pool - Azure Databricks | Microsoft Learn&lt;/A&gt;&lt;BR /&gt;It would be interesting for me to know&amp;nbsp;the&amp;nbsp;best practices of the vms in pools and any other ways to speed up the stratup.&lt;BR /&gt;Are there any plans to introduce serverless python clusters similar to serverless SQL warehouses?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 11:36:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/shared-job-clusters-on-azure-data-factory-adf/m-p/60305#M31630</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-02-15T11:36:00Z</dc:date>
    </item>
  </channel>
</rss>

