There is a job that running successful but it's for more than a month we are experiencing long run which gets failed. In the stdout log file(attached), there are numerous following messages:
[GC (Allocation Failure) [PSYoungGen:...] and [Full GC (System.gc()) [PSYoungGen:...]
It seems I am getting GC issues that take a longer time to run and then it fails every time. In one of the executors log within SparkUI\Executors page I see an error message (ExecLossReason.png) showing that "Executor decommission: worker decommissioned because of kill request from HTTP endpoint (data migration disabled)"
Then within Spark config parameters I added the following
spark.databricks.dataMigration.enabled true
I tried to use stronger Compute/Worker/Driver type but still I get the same failure message.
Any thoughts? How can I resolve this issue while the pipeline job is working correctly in DEV, UAT up to PROD but in QA?