Databricks Community

lmorrissey · ‎02-05-2025

There are a couple of related posts here and here.
Seeing a similar issue with a long running job. Processes are in a "RUNNING" state, cluster is active, but stdout log shows the dreaded GC Allocation Failure.

Env:

I've set the following on the config:

.config("spark.cleaner.referenceTracking.cleanCheckpoints", "true")

.config("spark.cleaner.periodicGC.interval", "1min")

and have attempted to clear the cache:

spark.catalog.clearCache()

Is there anything else I can try? Is it possible to set up an alert for this error to kill the job when it enters this state so we aren't burning through resources?

mark_ott · Tuesday

A persistent "GC Allocation Failure" in Spark jobs, where processes are stuck in the RUNNING state even after attempts to clear cache and enforce GC, typically indicates ongoing memory pressure, possible data skew, or excessive memory use on the driver or executors. While your GC tuning and cache-clearing efforts are good initial steps, you may need to take broader measures.

Additional Troubleshooting Steps

Increase executor/driver memory: Often, "Allocation Failure" points to insufficient memory on either the driver or executors. Scale memory allocations incrementally and monitor logs for improvements.
Check for data skew and repartition: Analyze if a small number of tasks are handling an excessive amount of data. Redistribute or increase the number of partitions to avoid overloading single tasks or executors.
Experiment with GC Algorithms: Switching from the default Parallel GC to G1GC can balance garbage collection and improve performance, as evidenced by other users with similar issues.
Optimize job logic: Review actions like collect(), show(), or caching large unneeded datasets, as these often overwhelm the driver memory.
Avoid shared clusters for heavy jobs: Run long-running or memory-intensive jobs on dedicated clusters to avoid resource contention.
Use Spark UI to monitor memory, stage/task distribution, and GC events in detail.

Setting Up Alerts and Auto-Killing Jobs

Spark and YARN do not natively alert or kill jobs specifically on JVM "GC Allocation Failure" stuck states, but you have several workaround strategies:

Monitor logs externally: Use log analysis tools (e.g., Papertrail, Splunk) to watch for "GC (Allocation Failure)" patterns and trigger alerts or automated scripts to kill jobs.
Custom SparkListener or YARN/Databricks job monitoring: Implement a script or job monitoring tool to detect hung jobs (e.g., job stuck in RUNNING for X time with no progress) and invoke a REST API or CLI command to kill the job.
Databricks/Cloud environments: Leverage built-in alerting/logging or cluster/job timeout policies to force shutdowns on inactivity or continuous error logs.
Spark configuration: For executor-level failures, newer Spark releases offer features like spark.excludeOnFailure.killExcludedExecutors to automatically kill problematic executors, but these do not directly target GC failures.

Key Configurations and References

Setting/Approach	Description/Evidence
Increase executor/driver memory	See improvements by scaling up resources
Switch to G1GC	More balanced GC behavior in production
Monitor job progress/logs	Use monitoring/log tools for alerts
Use dedicated clusters	Avoids contention issues
Repartition & avoid data skew	Distributes tasks to prevent single-task overload

Implementing a combination of these strategies will help reduce the likelihood of jobs getting stuck in this state and allow for faster remediation if it happens again. For job auto-termination, external monitoring tied to log analysis is the most flexible approach when direct in-job detection isn't available.