Databricks Community

TjommeV-Vlaio · ‎01-09-2025

Hi,

We're running DBR 14.3 on a shared multi-node cluster.

When checking the metrics of the driver, I see that the Memory utilization and Memory swap utilization are increasing a lot and are almost never decreasing. Even if no processes are running anymore.

It seems that some processes are allocating memory, but are never releasing it.

Is there a way to detect which processes are allocating memory on the driver node?

Is there a way to detect which processes are causing the memory swap utilization on the driver node?I know these are the result of memory pressure, but it seems these are not released after a node crash and restart due to Out-of-memory (OOM)

Walter_C · ‎01-09-2025

The Spark UI provides detailed information about the memory usage of different processes. You can access the Spark UI by navigating to the "Executors" tab, which shows the memory usage of the driver and executors. This can help identify if specific tasks or stages are consuming excessive memory.

TjommeV-Vlaio · ‎01-09-2025

Hi Walter,

This overview shows the consumption by node, not by process.

Thread dump and heap histogram do not seem to provide usefull information (for my issue)

-werners- · ‎01-09-2025

To add on this, in case the spark ui does not help:
try to ssh into the driver and check in 'top' (or htop if it is installed) what processes use the mem.
check the RES, VIRT and SWAP columns (and COMMAND to see which program).
GC should free up memory, but perhaps for some reason memory does not get released.

TjommeV-Vlaio · ‎01-09-2025

Do I do this from Databricks or from Azure? To do it from Azure, I'm missing the credentials to connect.

From Databricks I don't know how to do this

-werners- · ‎01-09-2025

there is something as the 'web terminal' that you can enable in the settings.
This will open a terminal on the driver (I am pretty sure it is the driver and not a worker).
And from there you can run top/htop etc like on a normal linux shell.
If you are not comfortable with linux you might wanna ask someone who is.

-werners- · ‎01-09-2025

non-interactive commands (like 'free´) can be run from notebooks btw using the %sh magic command.

TjommeV-Vlaio · ‎01-09-2025

Found the web terminal doc: Run shell commands in Azure Databricks web terminal - Azure Databricks | Microsoft Learn

Unfortunately, we're running shared clusters on DBR 14.3, so no web terminal support

Running %sh htop from a notebook does not align with memory usage shown in Metrics tab

-werners- · ‎01-09-2025

htop in a notebook looks kinda wonky so i would not use that.
Free gives you a general overview so with free -h or -m you can also see some info.

https://www.howtogeek.com/659529/how-to-check-memory-usage-from-the-linux-terminal/

Also trust os commands over the metrics. Nothing knows better what's going on on an os than the os itself.

But it being a shared interactive cluster: how long has it been you restarted it? Is it always the same job that gives issues? Are you sure nobody is running anything?
I'd check the same workload on a single user cluster and see what happens. Shared clusters do have some limitations.

TjommeV-Vlaio · ‎01-09-2025

I'll check this out.

My goal is to see which notebooks/processes are consuming large amounts of driver memory (without releasing it) as this might indicate there is a memory leak or coding contains some non-parallel code that needs to be resolved.

-werners- · ‎01-09-2025

On OS level you will not see notebooks, you will see the mem consumption of the spark application (so this is all notebooks).
For that there is the spark ui.

I'd look for collect(), broadcast() statements. Python code outside of spark, tons of graphics/docs in notebooks (makes the notebook heavy), loops over dataframe records etc. It all exists 😞

Databricks Community

Which process is eating up my driver memory?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!