cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

If two Data Factory pipelines are run at the same time or share a window of execution do they share the Databricks spark cluster(if both have the same linked service)? ( job clusters are those that are create on the go, defined in the linked service).

irfanaziz
Contributor II

Continuing the above case, does that mean if i have several like 5 ADF pipelines scheduled regularly at the same time, its better to use an existing cluster as all of the ADF pipelines would share the same cluster and hence the cost will be lower?

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Databricks Employee
Databricks Employee

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri Aโ€‹ 

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

ADF pipelines will execute the notebooks as follows:

if you use a dedicated cluster and you run 2 notebooks simultaneously on the dedicated cluster, it will acutally run both.

Mind that dedicated clusters are more expensive than job clusters.

The same can be achieved using notebook workflow where you call parallel notebooks within one single notebook (which is scheduled in ADF). So like that you do not have to use a dedicated cluster and instead use a job cluster.

The main downside of this is that your cluster may get hammered because of the parallel runs. Not necessarily, but that is definitely a concern.

So you could also opt for a cluster pool, whic you can use in ADF. It is not exactly the same as using a single cluster but workers that are not needed can be used for other jobs, until they timeout after x minutes of inactivity.

So depending on your scenario you can go one way or another.

Me, I never use dedicated clusters because of the price. So I use separate job clusters, notebook workflows and pools.

irfanaziz
Contributor II

Since all the pipelines are orchestrated via ADF, so we are using mostly dedicated clusters but the sizes are small. So the idea to run multiple notebooks via single notebook is not an optimal solution in this case.

So i think if you have several pipelines and each one use a job cluster it would end up with higher cost as i think the job cluster is not shared between jobs as they are created on the go.

-werners-
Esteemed Contributor III

With workbook workflows you can use a job cluster for several notebook simultaneously.

Only pay attention to the cluster load.

This is the cheapest option.

Cluster pools are also an option as you can use spot instances and you can save money for startup times of the nodes.

Atanu
Databricks Employee
Databricks Employee

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri Aโ€‹ 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group