Databricks

Tahseen0354 · ‎03-21-2022

Hi, I have seen it written in the documentation that standard cluster is recommended for a single user. But why ? What is meant by that ? Me and one of my colleagues were testing it on the same notebook. Both of us can use the same standard all purpose cluster in the same notebook at the same time. It is just that we could not execute the same cell at the same time but that is reasonably normal.

But if two persons can use the same standard all purpose cluster in the same notebook at the same time, then why it is recommended for single user ? Does that mean that we should select high concurrency cluster when multiple people are collaborating in the same notebook at the same time for simple data read and write experiments ?

Hubert-Dudek · ‎03-21-2022

High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.

In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.

In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")

View solution in original post

Atanu · ‎03-21-2022

Standard clusters are ideal for processing large amounts of data with Apache Spark. We recommand to use standard cluster for a single user because it meant to be handle less load compare to high con cluster.
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs. Administrators usually create High Concurrency clusters.The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.
you can go through this best practice which may help - https://docs.databricks.com/clusters/cluster-config-best-practices.html

Tahseen0354 · ‎03-21-2022

Thank you so much for your reply. So I think it is more related to how the load is handled, not how many users are using the cluster.

Hubert-Dudek · ‎03-21-2022

High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.

In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.

In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")

Tahseen0354 · ‎03-21-2022

Thank you so much for your reply. Now it makes more sense.

Databricks

A Standard cluster is recommended for a single user - what is meant by that ?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs