01-21-2025 08:24 AM
When running multiple jobs on the same compute cluster, over time, I see an increase in memory utilization that is seemingly never fully released, even when jobs finish. This eventually leads to some jobs stalling out as memory hits the upper limit, as well as cluster crashes. I have replicated the issue by running a single job in a notebook environment (connected to the cluster), detaching the notebook, then running again, and I indeed see the consistent uptick in memory with each subsequent run. From this test it seems that the cluster is never fully releasing memory after a job (or notebook) is finished/detached.
Manually deleting objects produced throughout the job:
Outcome:
Objects removed, but memory usage in the cluster did not decrease significantly.
Garbage Collection:
Outcome:
No noticeable reduction in memory usage.
Notebook Detach and Reattach:
Outcome:
Produces a small decrease in memory util, but not a return to what util looked like before attaching + running the notebook.
Restarting the Cluster:
I am looking for:
01-22-2025 09:14 AM
Hi @Nate-Haines,
Regularly restarting long-running shared clusters will help release memory on the cluster. The Driver maintains the state information for the all notebooks attached to the cluster and as well as maintaining the SparkContext: leaving these notebooks idle on the cluster seems can cause the Driver not being able to efficiently release memory.
Hope this helps!
Best,
Miguel
01-23-2025 12:56 AM
Hi @Miguel_Suarez ,
So when I have multiple jobs to run, is it suggested to use multiple clusters for each job instead of one standalone high concurrency cluster? Does it improve performance?
01-22-2025 01:36 PM
I'm encountering something similar. Immediately upon starting a cluster and triggering a job run, my memory usage jumps from 0 to about 20GB used and 15GB cached (see the attached screenshot). The data I am working with should be very small (less than 1GB), but for some reason the memory usage is still very high and memory is not freed up upon completion of the job.
There are multiple Python tasks in my pipeline, and after each task the memory used seems to remain constant which to me indicates that memory is not being properly freed up even after dataframes and other in-memory objects are not longer being used. My cluster has not been running for long periods of time, so I would not expect that it should require a restart. I would like to be able to run multiple concurrent jobs on the same cluster, so it's concerning that even a single task is consuming so much memory.
Thank you,
Kyle
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now