cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Adhoc workflows - managing resource usage on shared clusters

JameDavi_51481
New Contributor III

We run a shared cluster that is used for general purpose adhoc analytics, which I assume is a relatively common use case to try to keep costs down. However, the technical experience of users of this cluster varies a lot, so we run into situations where a single user can completely bog down the cluster with a bad query or notebook. This usually shows up as a process that moves too much data to the leader node - usually a Pandas dataframe.

 

Beyond better training, we would like to put in place technical controls to prevent an individual user from breaking the cluster for all other users. Before we adopted Unity Catalog, our workaround was to run a Python notebook job that would call out to `psutil` to identify processes that are using a lot of memory, and kill those. In a Unity shared access mode cluster, this no longer works due to the stricter access control model.

 

If all else fails, we would probably try to set up a cron job to kill high memory PIDs in the same way, but I wanted to check with the community first to see if there's a more elegant solution. Is there some Spark configuration option we can set, or some blessed mechanism for managing resource usage within notebooks?

4 REPLIES 4

xorbix_rshiva
Contributor

You could try enabling fair sharing mode for the SparkContext. In the Databricks workspace, go to Compute and view the configuration for the cluster. Go to Configuration -> Advanced Options -> Spark -> Spark config.

In Spark config for the cluster, add this line:
"spark.scheduler.mode FAIR"

By default, Spark runs jobs using First In First Out prioritization. If there's a large job at the head of the queue, then other later jobs will be delayed while the larger job is executed. This configuration will give a more equal share to all jobs running on the cluster, which makes it ideal for shared clusters. Source: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Another thing you can try is assigning individual jobs to scheduler pools. You can assign higher priority jobs to dedicated pools to ensure they will have compute resources available.

Please refer to this Databricks documentation article on scheduler pools, including a code example: https://docs.databricks.com/en/structured-streaming/scheduler-pools.html

For jobs that are likely to be resource hogs, you could schedule them as workflows and configure separate job clusters to handle the workloads. This will save you Databricks costs since job compute is cheaper than all purpose compute.

Lastly, you may want to consider enabling autoscaling for your cluster if you have not done so already. When the cluster resources are maxed out, the cluster can dynamically spin up more workers as needed. However, this will not scale up the resources of your driver node, so it doesn't solve the problem of queries that overutilize the driver.

Thanks for all the ideas - the specific issue I am running into is that individuals who don't know any better are writing notebooks that exhaust the memory on the leader node, so some of these (such as autoscaling) will have no impact on that.

xorbix_rshiva
Contributor

Here's another idea: configure a Personal Compute policy and restrict the inexperienced users from attaching to the shared cluster, Then, only grant unrestricted cluster creation permissions to trusted users.

You can override the default personal compute settings to set clusters to be single node, set auto termination time, etc. You can even configure clusters ahead of time for some users.

This solution does require you to create multiple clusters, however.

I appreciate the effort to provide other ideas here but I am specifically looking to address managing resources consumed by notebooks running on a single shared cluster, so these other suggestions are somewhat of a distraction.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group