ā06-21-2023 06:25 AM
Hi databricks/spark experts!
I have a piece on pandas-based 3rd party code that I need to execute as a part of a bigger spark pipeline. By nature, pandas-based code is executed on driver node. I ran into out of memory problems and started exploring the topic of monitoring driver node memory utilization.
My questions are:
1) I have an idle cluster with 56Gb of RAM, and when looking at new "Metrics" I see weird memory fluctuations/cycles. Where do these cycles come from? The cluster is not running any code (CPU util ~0%) so I am wondering what's going on? 2) My understanding is that orange "used" series is showing memory used by my python code. But what exactly is this greenish-blue area below called "other"? Even when I run my code, the vast majority of memory on 56GB of RAM driver node is occupied by this "other" stuff:
I dont belive that OS/Docker/Spark/JVM stuff takes 35-40 gigs of RAM. So what exactly is it? And how can I reduce it and make more "room" for my code?
3) How does spark.driver.memory setting affects all that? Accrding to spark docs, by default it is 1g. Is this the max amount of memory that I can use when running my code ("used" series)? 1g seems extremally low. Would it make sense to increase it to 8 or 16g for my scenario?
Thx!
ā06-21-2023 10:19 PM
Hi @Wojciech Jakubowskiā
Great to meet you, and thanks for your question!
Let's see if your peers in the community have an answer to your question. Thanks.
ā08-14-2023 02:52 PM
You can always reconfigure that and set it to higher size.
spark.driver.maxResultSize 4g --this will allocate 4GB of driver memory
ā08-14-2023 10:36 PM
About the first question, driver memory utilization is high and we could see multiple cycles of high utlization. The primary reason behind this is, even if a cluster is idle, driver has to perform multiple operations to keep the cluster active and ready for processing. Some of the activities are
This happens in intervals and this is the reason behind memory utilization happening in cycles on the driver.
ā08-16-2023 05:00 AM
Hi Tharun-Kumar.
Thanks for your answer. I get that all these activities you listed are required for cluster to function correctly. But 40 gigs of ram for that? That looks way too much imo... especially that all these activates are also done on much smaller drivers that have 8 or 16 gigs of RAM...
ā08-14-2023 10:46 PM
About your third question, you can get to know the actual value of spark.driver.memory by looking at the executors tab in spark UI. This will also have the driver and we can get to know the actual value of the driver memory.
In the spark UI, we would be able to see only the storage memory. Execution memory will almost be equal to the storage memory.
In this case, driver memory would be 21.4GB. This will be the amount of memory allocated to JVM related activities.
ā08-16-2023 05:03 AM
Hi,
How did you infer from this image that driver memory would be 21.4 GB? Shouldn't it be 10.4 GB?
Also, if a memory is allocated to JVM related activities, can this memory can be also utilized from python? Meaning if I have 21.4 gb for JVM, does it mean I can use all that memory from python (for instance to load some crazdy pandas dataframes)?
ā08-16-2023 12:59 PM
1. JVM memory will not be utilized for python related activities.
2. In the image we could only see the storage memory. We also have execution memory which would also be the same. Hence I came up with the executor memory to be of size 21.4GB
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonāt want to miss the chance to attend and share knowledge.
If there isnāt a group near you, start one and help create a community that brings people together.
Request a New Group