I am experiencing memory leaks on a Standard (formerly shared) interactive cluster:
1. We run jobs regularly on the cluster
2. After each job completes, driver memory usage continues to increase, suggesting resources aren't fully released
3. Eventually, the cluster crashes with "Out of Memory" errors after ~1/2 day of executions
Details:
- Using DBR 15.4
- The same jobs ran for weeks without issues on a "No isolation shared" cluster
- We are using ThreadPoolExecutor to parallelize notebook/function runs
- The memory usage grows consistently after each job execution
- We can see occasional drops in memory usage (indicating garbage collection is happening), but the overall trend is still increasing memory consumption over time
Questions:
1. Is this a known issue with DBR 15.4 on Standard (formerly shared) interactive clusters?
2. Are there specific settings or configurations that might help prevent this memory leak?
3. Are there recommended methods to monitor and diagnose driver memory usage when direct JVM access is restricted?
Any guidance would be appreciated!
I noticed similar posts
https://community.databricks.com/t5/data-engineering/restarting-the-cluster-always-running-doesn-t-f...
https://community.databricks.com/t5/data-engineering/drive-memory-utilization-cleanup/td-p/106519: