08-02-2024 02:25 PM
There is a job that running successful but it's for more than a month we are experiencing long run which gets failed. In the stdout log file(attached), there are numerous following messages:
[GC (Allocation Failure) [PSYoungGen:...] and [Full GC (System.gc()) [PSYoungGen:...]
It seems I am getting GC issues that take a longer time to run and then it fails every time. In one of the executors log within SparkUI\Executors page I see an error message (ExecLossReason.png) showing that "Executor decommission: worker decommissioned because of kill request from HTTP endpoint (data migration disabled)"
Then within Spark config parameters I added the following
spark.databricks.dataMigration.enabled true
I tried to use stronger Compute/Worker/Driver type but still I get the same failure message.
Any thoughts? How can I resolve this issue while the pipeline job is working correctly in DEV, UAT up to PROD but in QA?
a week ago
Hi Sid,
These are the list of action items that helped me resolve the issue:
By these actions, the first step of my workflow ran but I had a separate issue with the next step which was a file system issue. It could not find some of the delta tables and the locations they were put in. I could resolve the issue this way that let me complete the job.
08-21-2024 05:53 PM
Hi @Retired_mod
Your advice worked pretty fine and I could get rid of [GC (Allocation Failure) [PSYoungGen:...] totally and also by picking stronger driver/worker types, the issue in production went away.
I understood the default setting for GC was 'Parallel GC' and by configuring G1GC I can see more balanced behavior for the GC and also driver/workers are working more efficiently into some extent.
Thanks again,
Shahab
a week ago
Hi @shahabm , I'm facing exactly the same issue and increasing driver type or number of workers isn't helping too. Could you please guide me how it got resolved for you as I don't see the comment or post in which you got advice. This problem causing so much delays and escalations in delivery. Appreciate your timely guidance on it.
Thanks in advance!
Regards,
Sid
a week ago
Hi Sid,
These are the list of action items that helped me resolve the issue:
By these actions, the first step of my workflow ran but I had a separate issue with the next step which was a file system issue. It could not find some of the delta tables and the locations they were put in. I could resolve the issue this way that let me complete the job.
a week ago
Thanks a lot @shahabm for your prompt response, appreciate it. I'll try to debug in this direction.
Thanks again!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group