3 weeks ago
Hey Community, I’m new to this platform and need some guidance.
I’ve set up a job on a basic compute configuration: 8GB RAM, 4 Core CPU, 1 Worker (Standard F4), with DLT runtime 16.4.8. However, my job is running slower than expected. When I checked the cluster metrics, I noticed that the Memory Swap Utilization is peaking, which I suspect might be causing the slowdown.
I’ve attached the chart for reference. Could you please take a look and let me know what the best approach would be? Should I increase the cluster memory, or is there another way to handle this? Also, I’m a bit unclear about what exactly the Memory Swap Utilization chart indicates—would appreciate some clarification on that too.
Thanks in advance!
3 weeks ago
Hi @mkwparth ,
It seems you have too small cluster for your workload. Memory Swap Utilizations measures how much memory the JVM is spilling to disk because physical RAM is exhausted. When memory pressure is high (e.g. large joins, shuffles, caching), Spark will spill shuffle data to disk and evict cache data to disk.
So to put it simple - either your workload doesn't fit your cluster which means you're processing a lot of data. Of your workload is heavy in wide transformation, i.e a lot of joins, group by.
You can try to scale out your cluster a bit. Maybe try to use F8 or F16 with at least 16-32 GB RAM.
3 weeks ago
Hello @mkwparth
Can you show metric of driver memory utilization screenshot please.
Are you trying to cache anything.
May be you need to increase the memory as you trying to run multiple jobs or decrease gc collection to 10 mins from default 30 mins
3 weeks ago
Hey @Khaja_Zaffer ,
I’ve already attached the screenshot above. Is this the metric you were referring to? If not, could you please let me know where I can find the specific metrics you’re asking for?
3 weeks ago
This screenshot tells everything. You don't have enough memory in your cluster and hence it spills to disk (memory swap utilization is high). You have really small cluster and your workload needs more memory. Just try to use cluster with more memory and your problem will be solved 🙂
3 weeks ago
Hey @Khaja_Zaffer, I’m not using any caching in my code. The cache showing up in the chart is likely from the OS page cache. What do you say?
3 weeks ago
Hello @mkwparth
Like my friend @szymon_dybczak shared the details already.
To dig further, (unfortunately I only have databricks community edition which has resistriction to show metrics) But
Like in the metrics, on the right side - you can select driver dropdown instead of Compute. Just to confirm what is the usage.
You can also run:
%sh
free -h
top -b -n 1 | head -n 20
jps -l
the above are system monitoring commands. Also you can run
%python
import os
pid = os.popen("jps | grep DriverDaemon | awk '{print $1}'").read().strip()
heap_dump_command = f"jmap -dump:live,format=b,file=/tmp/heapdump_{pid}.hprof {pid}"
print(os.popen(heap_dump_command).read())
gc_stats_command = f"jstat -gc {pid}"
print(os.popen(gc_stats_command).read())
memory_summary_command = f"jmap -heap {pid}"
print(os.popen(memory_summary_command).read())
open_files_command = "lsof | grep java | wc -l"
print(os.popen(open_files_command).read())
the above are snapshot of the JVM (Java Virtual Machine) memory utilization for your Spark Driver. You will get much more clear idea about what is actually happening on your driver.
As from the images shared from you, its clear that there is no space available. Like you know, spark in general purpose in memory compute engine, it will try to bring the data from disk(data lake) to memory and do the process.
In this duration, as the compute you selected is using almost the memory, you have to reconfigure the compute for the data you are processing. If you dont do that in the future you will obviously get DRIVER_NOT_RESPONDING Compute Events (Driver is up but is not responsive, likely due to GC).
So
The driver instance type is not optimal for the load executed on the driver.
There are memory-intensive operations executed on the driver.
Many notebooks or jobs are running in parallel on the same cluster. We don't recommend this, because it can cause unexpected behaviors.
I hope this helps you.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now