13 hours ago
Hi,
We're running DBR 14.3 on a shared multi-node cluster.
When checking the metrics of the driver, I see that the Memory utilization and Memory swap utilization are increasing a lot and are almost never decreasing. Even if no processes are running anymore.
It seems that some processes are allocating memory, but are never releasing it.
Is there a way to detect which processes are allocating memory on the driver node?
Is there a way to detect which processes are causing the memory swap utilization on the driver node?I know these are the result of memory pressure, but it seems these are not released after a node crash and restart due to Out-of-memory (OOM)
13 hours ago
The Spark UI provides detailed information about the memory usage of different processes. You can access the Spark UI by navigating to the "Executors" tab, which shows the memory usage of the driver and executors. This can help identify if specific tasks or stages are consuming excessive memory.
13 hours ago
Hi Walter,
This overview shows the consumption by node, not by process.
Thread dump and heap histogram do not seem to provide usefull information (for my issue)
13 hours ago
To add on this, in case the spark ui does not help:
try to ssh into the driver and check in 'top' (or htop if it is installed) what processes use the mem.
check the RES, VIRT and SWAP columns (and COMMAND to see which program).
GC should free up memory, but perhaps for some reason memory does not get released.
13 hours ago
Do I do this from Databricks or from Azure? To do it from Azure, I'm missing the credentials to connect.
From Databricks I don't know how to do this
12 hours ago
there is something as the 'web terminal' that you can enable in the settings.
This will open a terminal on the driver (I am pretty sure it is the driver and not a worker).
And from there you can run top/htop etc like on a normal linux shell.
If you are not comfortable with linux you might wanna ask someone who is.
12 hours ago
non-interactive commands (like 'free´) can be run from notebooks btw using the %sh magic command.
11 hours ago
Found the web terminal doc: Run shell commands in Azure Databricks web terminal - Azure Databricks | Microsoft Learn
Unfortunately, we're running shared clusters on DBR 14.3, so no web terminal support
Running %sh htop from a notebook does not align with memory usage shown in Metrics tab
11 hours ago
htop in a notebook looks kinda wonky so i would not use that.
Free gives you a general overview so with free -h or -m you can also see some info.
https://www.howtogeek.com/659529/how-to-check-memory-usage-from-the-linux-terminal/
Also trust os commands over the metrics. Nothing knows better what's going on on an os than the os itself.
But it being a shared interactive cluster: how long has it been you restarted it? Is it always the same job that gives issues? Are you sure nobody is running anything?
I'd check the same workload on a single user cluster and see what happens. Shared clusters do have some limitations.
10 hours ago
I'll check this out.
My goal is to see which notebooks/processes are consuming large amounts of driver memory (without releasing it) as this might indicate there is a memory leak or coding contains some non-parallel code that needs to be resolved.
10 hours ago
On OS level you will not see notebooks, you will see the mem consumption of the spark application (so this is all notebooks).
For that there is the spark ui.
I'd look for collect(), broadcast() statements. Python code outside of spark, tons of graphics/docs in notebooks (makes the notebook heavy), loops over dataframe records etc. It all exists 😞
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group