Databricks Community

Tahseen0354 · ‎03-21-2022

Hi, I have seen it written in the documentation that standard cluster is recommended for a single user. But why ? What is meant by that ? Me and one of my colleagues were testing it on the same notebook. Both of us can use the same standard all purpose cluster in the same notebook at the same time. It is just that we could not execute the same cell at the same time but that is reasonably normal.

But if two persons can use the same standard all purpose cluster in the same notebook at the same time, then why it is recommended for single user ? Does that mean that we should select high concurrency cluster when multiple people are collaborating in the same notebook at the same time for simple data read and write experiments ?

Hubert-Dudek · ‎03-21-2022

High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.

In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.

In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")

View solution in original post

Atanu · ‎03-21-2022

Standard clusters are ideal for processing large amounts of data with Apache Spark. We recommand to use standard cluster for a single user because it meant to be handle less load compare to high con cluster.
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs. Administrators usually create High Concurrency clusters.The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.
you can go through this best practice which may help - https://docs.databricks.com/clusters/cluster-config-best-practices.html

Tahseen0354 · ‎03-21-2022

Thank you so much for your reply. So I think it is more related to how the load is handled, not how many users are using the cluster.

Hubert-Dudek · ‎03-21-2022

High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.

In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.

In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")

Tahseen0354 · ‎03-21-2022

Thank you so much for your reply. Now it makes more sense.

Databricks Community

A Standard cluster is recommended for a single user - what is meant by that ?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon