Databricks Community

minhhung0507 · ‎06-04-2025

I'm running into an issue where my Spark driver keeps pausing and eventually restarting due to excessive garbage collection (GC), even though I’ve already scaled up the cluster memory. Below is an example from the driver logs:

Driver/192.168.231.23 paused the JVM process 74 seconds during the past 120 seconds (61.71%) because of GC. We observed 3 such issue(s) since 2025-06-04T07:34:56.504Z.

Attached is a screenshot showing multiple gc_pressure warnings over the past hour, with pauses consistently exceeding 60 seconds out of every 120.

So far I’ve:

Increased driver memory.
Monitored memory usage and confirmed it’s not fully consumed, but GC still takes a long time.

My questions:

What are the most common root causes for this kind of sustained GC pressure in Spark drivers?
Are there any specific Spark configurations or GC tuning parameters that could help mitigate this?
Would refactoring code to reduce memory usage on the driver (e.g. avoiding collect(), broadcasting, etc.) be more effective?
Is there a recommended way to identify large object allocations or memory leaks that lead to GC overload?

Any insight or guidance would be greatly appreciated!

Thanks!

Regards,
Hung Nguyen

Louis_Frolio · ‎06-04-2025

Here are some things to consider:

There are several actionable insights and recommendations for addressing sustained garbage collection (GC) pressure on Spark drivers:

Common Root Causes of GC Pressure: 1. Insufficient JVM Heap Configuration: - Spark's on-heap memory management (enabled by default) involves JVM garbage collection of execution and storage zones. Improper heap size can lead to prolonged GC pauses.

Memory Bottlenecks Due to Driver Workload:
- Memory-intensive operations, like collect() and toPandas() operations, can overload the driver by requiring it to handle large amounts of data.
Excessive Object Creation:
- Certain Spark components may create large objects or collections in JVM memory, potentially leading to frequent GC cycles.
Scalability Issues with Shared Clusters:
- Running multiple notebooks or concurrent jobs on the same cluster can exacerbate memory constraints on the driver.
File Listing Operations:
- Scanning tables with numerous files (e.g., non-Delta or unoptimized Delta tables) may contribute to frequent GC activity.

Mitigation Strategies: 1. Driver Memory Increase: - If memory usage exceeds the driver configuration, scale up the driver memory using spark.driver.memory or upgrade the driver node type.

Garbage Collector Tuning:
- Switch to a more efficient garbage collection algorithm, such as G1GC, by configuring spark.driver.extraJavaOptions to use -XX:+UseG1GC. G1GC is known to alleviate GC bottlenecks in some cases.
Refactor Memory-Intensive Operations:
- Avoid operations like collect() and toPandas() on large datasets. Replace them with distributed alternatives, such as Spark transformations or Spark ML instead of scikit-learn.
Off-Heap Memory Allocation:
- Enable off-heap memory by setting spark.memory.offHeap.enabled=true and tuning spark.memory.offHeap.size. This isolates Spark workloads from JVM garbage collection, reducing GC interference.
Heap Dump Analysis:
- Identify memory leaks by enabling heap dumps (via JVM options like -XX:+HeapDumpOnOutOfMemoryError) and analyzing them with tools like YourKit or Eclipse MAT to locate large object allocations.
Batch Job Isolation:
- Avoid running batch jobs on shared interactive clusters. Dedicate separate clusters for these workloads to prevent memory bottlenecks.
Optimized Table Scanning:
- Enable table optimizations (e.g., OPTIMIZE command for Delta tables) to reduce the overhead of file listing.
Driver Usage Metrics:
- Monitor driver memory usage and GC performance using Spark UI, logs, or metrics tools integrated with Databricks. Look for “Full GC” log entries and high GC pressure percentages for diagnostic insights.

By applying these strategies, sustained GC pressure on Spark drivers can be effectively managed and mitigated.

Cheers, Lou.

minhhung0507 · ‎06-04-2025

Thank you very much for your detailed analysis and helpful recommendations.

We have reviewed your suggestions, and I’d like to share a quick update:

We have already tried most of the mitigation strategies you mentioned — including increasing driver memory, tuning the garbage collector, refactoring memory-heavy operations, and analyzing driver metrics. However, we have not yet explored off-heap memory allocation (item #4), and we will consider testing this next.

Also, it's worth noting that this issue started occurring only after we switched from GKE to GCE. Previously, our pipelines were running smoothly without any GC-related performance degradation.

Once again, we appreciate your insights and support.

Kind regards,
Hung

Regards,
Hung Nguyen