Databricks Community

Nate-Haines · ‎01-21-2025

Issue Summary:

When running multiple jobs on the same compute cluster, over time, I see an increase in memory utilization that is seemingly never fully released, even when jobs finish. This eventually leads to some jobs stalling out as memory hits the upper limit, as well as cluster crashes. I have replicated the issue by running a single job in a notebook environment (connected to the cluster), detaching the notebook, then running again, and I indeed see the consistent uptick in memory with each subsequent run. From this test it seems that the cluster is never fully releasing memory after a job (or notebook) is finished/detached.

Solutions Attempted:

Manually deleting objects produced throughout the job:
- del object
Outcome:
Objects removed, but memory usage in the cluster did not decrease significantly.
Garbage Collection:
- Triggered Python’s garbage collection using gc.collect() after deleting objects.
Outcome:
No noticeable reduction in memory usage.
Notebook Detach and Reattach:
- Replicated the job in a notebook, detaching the notebook once job is done.
Outcome:
Produces a small decrease in memory util, but not a return to what util looked like before attaching + running the notebook.
Restarting the Cluster:
- This works as intended, but is not viable given the cluster will be running concurrent jobs in practice.

Additional Context:

My workflow is specifically using Ibis with the Polars backend in the Databricks job. It involves creating many memtables dynamically for temporary use, which I assume should just all be cleaned up once the job is finished. The Pyspark backend does not work well for my use case so is not a solution here.
The Databricks cluster is configured with Shared access mode with a Shared Compute policy. The driver node is the node getting the high memory util that is not released. The driver has 128 GB RAM, 16 cores (r5d.4xlarge)
Restarting the cluster between jobs is not an option due to concurrent runs.

Request:

I am looking for:

A reliable way to return memory utilization back to normal once a job is finished, whether configured as a notebook or Python script.
General guidance on managing memory in shared compute Databricks clusters.
Any insights into how memory management works with Databricks clusters, and if particular configurations would be better suited for my use case.

Miguel_Suarez · ‎01-22-2025

Hi @Nate-Haines,

Regularly restarting long-running shared clusters will help release memory on the cluster. The Driver maintains the state information for the all notebooks attached to the cluster and as well as maintaining the SparkContext: leaving these notebooks idle on the cluster seems can cause the Driver not being able to efficiently release memory.

Hope this helps!

Best,

Miguel

Avinash_Narala · ‎01-23-2025

Hi @Miguel_Suarez ,

So when I have multiple jobs to run, is it suggested to use multiple clusters for each job instead of one standalone high concurrency cluster? Does it improve performance?

KyleGrymonpre · ‎01-22-2025

I'm encountering something similar. Immediately upon starting a cluster and triggering a job run, my memory usage jumps from 0 to about 20GB used and 15GB cached (see the attached screenshot). The data I am working with should be very small (less than 1GB), but for some reason the memory usage is still very high and memory is not freed up upon completion of the job.

There are multiple Python tasks in my pipeline, and after each task the memory used seems to remain constant which to me indicates that memory is not being properly freed up even after dataframes and other in-memory objects are not longer being used. My cluster has not been running for long periods of time, so I would not expect that it should require a restart. I would like to be able to run multiple concurrent jobs on the same cluster, so it's concerning that even a single task is consuming so much memory.

Thank you,

Kyle