Databricks Community

JameDavi_51481 · ‎06-03-2024

We run a shared cluster that is used for general purpose adhoc analytics, which I assume is a relatively common use case to try to keep costs down. However, the technical experience of users of this cluster varies a lot, so we run into situations where a single user can completely bog down the cluster with a bad query or notebook. This usually shows up as a process that moves too much data to the leader node - usually a Pandas dataframe.

Beyond better training, we would like to put in place technical controls to prevent an individual user from breaking the cluster for all other users. Before we adopted Unity Catalog, our workaround was to run a Python notebook job that would call out to `psutil` to identify processes that are using a lot of memory, and kill those. In a Unity shared access mode cluster, this no longer works due to the stricter access control model.

If all else fails, we would probably try to set up a cron job to kill high memory PIDs in the same way, but I wanted to check with the community first to see if there's a more elegant solution. Is there some Spark configuration option we can set, or some blessed mechanism for managing resource usage within notebooks?

xorbix_rshiva · ‎06-03-2024

You could try enabling fair sharing mode for the SparkContext. In the Databricks workspace, go to Compute and view the configuration for the cluster. Go to Configuration -> Advanced Options -> Spark -> Spark config.

In Spark config for the cluster, add this line:
"spark.scheduler.mode FAIR"

By default, Spark runs jobs using First In First Out prioritization. If there's a large job at the head of the queue, then other later jobs will be delayed while the larger job is executed. This configuration will give a more equal share to all jobs running on the cluster, which makes it ideal for shared clusters. Source: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Another thing you can try is assigning individual jobs to scheduler pools. You can assign higher priority jobs to dedicated pools to ensure they will have compute resources available.

Please refer to this Databricks documentation article on scheduler pools, including a code example: https://docs.databricks.com/en/structured-streaming/scheduler-pools.html

For jobs that are likely to be resource hogs, you could schedule them as workflows and configure separate job clusters to handle the workloads. This will save you Databricks costs since job compute is cheaper than all purpose compute.

Lastly, you may want to consider enabling autoscaling for your cluster if you have not done so already. When the cluster resources are maxed out, the cluster can dynamically spin up more workers as needed. However, this will not scale up the resources of your driver node, so it doesn't solve the problem of queries that overutilize the driver.

JameDavi_51481 · ‎06-03-2024

Thanks for all the ideas - the specific issue I am running into is that individuals who don't know any better are writing notebooks that exhaust the memory on the leader node, so some of these (such as autoscaling) will have no impact on that.

xorbix_rshiva · ‎06-03-2024

Here's another idea: configure a Personal Compute policy and restrict the inexperienced users from attaching to the shared cluster, Then, only grant unrestricted cluster creation permissions to trusted users.

You can override the default personal compute settings to set clusters to be single node, set auto termination time, etc. You can even configure clusters ahead of time for some users.

This solution does require you to create multiple clusters, however.

JameDavi_51481 · ‎06-03-2024

I appreciate the effort to provide other ideas here but I am specifically looking to address managing resources consumed by notebooks running on a single shared cluster, so these other suggestions are somewhat of a distraction.

sshssh · ‎11-13-2024

Hi, @JameDavi_51481 , were you able to figure something out?

Planning a Databricks migration and realized we might need something similar too.

Databricks Community

Adhoc workflows - managing resource usage on shared clusters

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap