03-21-2022 11:54 AM
Hi, I have seen it written in the documentation that standard cluster is recommended for a single user. But why ? What is meant by that ? Me and one of my colleagues were testing it on the same notebook. Both of us can use the same standard all purpose cluster in the same notebook at the same time. It is just that we could not execute the same cell at the same time but that is reasonably normal.
But if two persons can use the same standard all purpose cluster in the same notebook at the same time, then why it is recommended for single user ? Does that mean that we should select high concurrency cluster when multiple people are collaborating in the same notebook at the same time for simple data read and write experiments ?
03-21-2022 12:11 PM
High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.
In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.
In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
03-21-2022 12:08 PM
03-21-2022 01:06 PM
Thank you so much for your reply. So I think it is more related to how the load is handled, not how many users are using the cluster.
03-21-2022 12:11 PM
High concurrency cluster just split resource between users more evenly. So when 4 people run notebooks in the same time on cluster with 4 cpu you can imagine that every will get 1 cpu.
In standard cluster 1 person could utilize all worker cpus as your job have multiple partitions (for example 4) so will require multiple cores (1 cpu process 1 partition at a time so all 4 cpus will be busy processing 4 partitions) so other users' jobs will wait in queue till your job is finished.
In standard cluster you can also maintain resource allocations on notebook level using pools. To do that set sparkContext property in first line of notebook:
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
03-21-2022 01:06 PM
Thank you so much for your reply. Now it makes more sense.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group