Databricks Community

wojciech_jakubo · ‎06-21-2023

Hi databricks/spark experts!

I have a piece on pandas-based 3rd party code that I need to execute as a part of a bigger spark pipeline. By nature, pandas-based code is executed on driver node. I ran into out of memory problems and started exploring the topic of monitoring driver node memory utilization.

My questions are:

1) I have an idle cluster with 56Gb of RAM, and when looking at new "Metrics" I see weird memory fluctuations/cycles. Where do these cycles come from? The cluster is not running any code (CPU util ~0%) so I am wondering what's going on? Driver memory cycles_ 2) My understanding is that orange "used" series is showing memory used by my python code. But what exactly is this greenish-blue area below called "other"? Even when I run my code, the vast majority of memory on 56GB of RAM driver node is occupied by this "other" stuff:

Busy cluster I dont belive that OS/Docker/Spark/JVM stuff takes 35-40 gigs of RAM. So what exactly is it? And how can I reduce it and make more "room" for my code?

3) How does spark.driver.memory setting affects all that? Accrding to spark docs, by default it is 1g. Is this the max amount of memory that I can use when running my code ("used" series)? 1g seems extremally low. Would it make sense to increase it to 8 or 16g for my scenario?

Thx!

Anonymous · ‎06-21-2023

Hi @Wojciech Jakubowski

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

tapash-db · ‎08-14-2023

You can always reconfigure that and set it to higher size.

spark.driver.maxResultSize 4g --this will allocate 4GB of driver memory

Tharun-Kumar · ‎08-14-2023

@wojciech_jakubo

About the first question, driver memory utilization is high and we could see multiple cycles of high utlization. The primary reason behind this is, even if a cluster is idle, driver has to perform multiple operations to keep the cluster active and ready for processing. Some of the activities are

heart beat messages
gc
listening for job requests
hosting spark ui
monitoring resources

This happens in intervals and this is the reason behind memory utilization happening in cycles on the driver.

wojciech_jakubo · ‎08-16-2023

Hi Tharun-Kumar.

Thanks for your answer. I get that all these activities you listed are required for cluster to function correctly. But 40 gigs of ram for that? That looks way too much imo... especially that all these activates are also done on much smaller drivers that have 8 or 16 gigs of RAM...

Tharun-Kumar · ‎08-14-2023

@wojciech_jakubo

About your third question, you can get to know the actual value of spark.driver.memory by looking at the executors tab in spark UI. This will also have the driver and we can get to know the actual value of the driver memory.

In the spark UI, we would be able to see only the storage memory. Execution memory will almost be equal to the storage memory.

Screenshot 2023-08-15 at 11.15.31 AM.png

In this case, driver memory would be 21.4GB. This will be the amount of memory allocated to JVM related activities.

wojciech_jakubo · ‎08-16-2023

Hi,

How did you infer from this image that driver memory would be 21.4 GB? Shouldn't it be 10.4 GB?

Also, if a memory is allocated to JVM related activities, can this memory can be also utilized from python? Meaning if I have 21.4 gb for JVM, does it mean I can use all that memory from python (for instance to load some crazdy pandas dataframes)?

Tharun-Kumar · ‎08-16-2023

Hi @wojciech_jakubo

1. JVM memory will not be utilized for python related activities.

2. In the image we could only see the storage memory. We also have execution memory which would also be the same. Hence I came up with the executor memory to be of size 21.4GB