There are a couple of related posts here and here.
Seeing a similar issue with a long running job. Processes are in a "RUNNING" state, cluster is active, but stdout log shows the dreaded GC Allocation Failure.
Env:

I've set the following on the config:
.config("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
.config("spark.cleaner.periodicGC.interval", "1min")
and have attempted to clear the cache:
spark.catalog.clearCache()
Is there anything else I can try? Is it possible to set up an alert for this error to kill the job when it enters this state so we aren't burning through resources?

