cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

If two Data Factory pipelines are run at the same time or share a window of execution do they share the Databricks spark cluster(if both have the same linked service)? ( job clusters are those that are create on the go, defined in the linked service).

irfanaziz
Contributor II

Continuing the above case, does that mean if i have several like 5 ADF pipelines scheduled regularly at the same time, its better to use an existing cluster as all of the ADF pipelines would share the same cluster and hence the cost will be lower?

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Esteemed Contributor
Esteemed Contributor

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri A​ 

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

ADF pipelines will execute the notebooks as follows:

if you use a dedicated cluster and you run 2 notebooks simultaneously on the dedicated cluster, it will acutally run both.

Mind that dedicated clusters are more expensive than job clusters.

The same can be achieved using notebook workflow where you call parallel notebooks within one single notebook (which is scheduled in ADF). So like that you do not have to use a dedicated cluster and instead use a job cluster.

The main downside of this is that your cluster may get hammered because of the parallel runs. Not necessarily, but that is definitely a concern.

So you could also opt for a cluster pool, whic you can use in ADF. It is not exactly the same as using a single cluster but workers that are not needed can be used for other jobs, until they timeout after x minutes of inactivity.

So depending on your scenario you can go one way or another.

Me, I never use dedicated clusters because of the price. So I use separate job clusters, notebook workflows and pools.

irfanaziz
Contributor II

Since all the pipelines are orchestrated via ADF, so we are using mostly dedicated clusters but the sizes are small. So the idea to run multiple notebooks via single notebook is not an optimal solution in this case.

So i think if you have several pipelines and each one use a job cluster it would end up with higher cost as i think the job cluster is not shared between jobs as they are created on the go.

-werners-
Esteemed Contributor III

With workbook workflows you can use a job cluster for several notebook simultaneously.

Only pay attention to the cluster load.

This is the cheapest option.

Cluster pools are also an option as you can use spot instances and you can save money for startup times of the nodes.

Atanu
Esteemed Contributor
Esteemed Contributor

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri A​ 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.