Databricks

irfanaziz · ‎02-08-2022

Continuing the above case, does that mean if i have several like 5 ADF pipelines scheduled regularly at the same time, its better to use an existing cluster as all of the ADF pipelines would share the same cluster and hence the cost will be lower?

Atanu · ‎03-15-2022

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri A

View solution in original post

-werners- · ‎02-09-2022

ADF pipelines will execute the notebooks as follows:

if you use a dedicated cluster and you run 2 notebooks simultaneously on the dedicated cluster, it will acutally run both.

Mind that dedicated clusters are more expensive than job clusters.

The same can be achieved using notebook workflow where you call parallel notebooks within one single notebook (which is scheduled in ADF). So like that you do not have to use a dedicated cluster and instead use a job cluster.

The main downside of this is that your cluster may get hammered because of the parallel runs. Not necessarily, but that is definitely a concern.

So you could also opt for a cluster pool, whic you can use in ADF. It is not exactly the same as using a single cluster but workers that are not needed can be used for other jobs, until they timeout after x minutes of inactivity.

So depending on your scenario you can go one way or another.

Me, I never use dedicated clusters because of the price. So I use separate job clusters, notebook workflows and pools.

irfanaziz · ‎02-28-2022

Since all the pipelines are orchestrated via ADF, so we are using mostly dedicated clusters but the sizes are small. So the idea to run multiple notebooks via single notebook is not an optimal solution in this case.

So i think if you have several pipelines and each one use a job cluster it would end up with higher cost as i think the job cluster is not shared between jobs as they are created on the go.

-werners- · ‎02-28-2022

With workbook workflows you can use a job cluster for several notebook simultaneously.

Only pay attention to the cluster load.

This is the cheapest option.

Cluster pools are also an option as you can use spot instances and you can save money for startup times of the nodes.

Atanu · ‎03-15-2022

for adf or job run we always prefer job cluster. but for streaming, you may consider using interactive cluster . but anyway you need to monitor the cluster load, if loads are high there will be chance to job slowness as well as failure. also data size will be a factor. @nafri A

Databricks

If two Data Factory pipelines are run at the same time or share a window of execution do they share the Databricks spark cluster(if both have the same linked service)? ( job clusters are those that are create on the go, defined in the linked service).

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs