cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks job keep getting failed due to GC issue

shahabm
New Contributor II

There is a job that running successful but it's for more than a month we are experiencing long run which gets failed. In the stdout log file(attached), there are numerous following messages:

[GC (Allocation Failure) [PSYoungGen:...]    and   [Full GC (System.gc()) [PSYoungGen:...]

It seems I am getting GC issues that take a longer time to run and then it fails every time. In one of the executors log within SparkUI\Executors page I see an error message (ExecLossReason.png) showing that "Executor decommission: worker decommissioned because of kill request from HTTP endpoint (data migration disabled)"

Then within Spark config parameters I added the following

spark.databricks.dataMigration.enabled true

I tried to use stronger Compute/Worker/Driver type but still I get the same failure message.

Any thoughts? How can I resolve this issue while the pipeline job is working correctly in DEV, UAT up to PROD but in QA?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @shahabm, To resolve this, try increasing executor memory and enabling off-heap memory, experimenting with the G1GC garbage collector. Check for data skew and optimize partitioning to balance load, ensure adequate resources to avoid executor decommissioning, and verify consistent job configurations. Enable and analyze GC logs for insights, and optimize resource allocation. If using data migration settings, ensure they fit your workload. 

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @shahabm, To resolve this, try increasing executor memory and enabling off-heap memory, experimenting with the G1GC garbage collector. Check for data skew and optimize partitioning to balance load, ensure adequate resources to avoid executor decommissioning, and verify consistent job configurations. Enable and analyze GC logs for insights, and optimize resource allocation. If using data migration settings, ensure they fit your workload. 

shahabm
New Contributor II

Hi @Kaniz_Fatma 

Your advice worked pretty fine and I could get rid of [GC (Allocation Failure) [PSYoungGen:...] totally and also by picking stronger driver/worker types, the issue in production went away.

I understood the default setting for GC was 'Parallel GC' and by configuring G1GC I can see more balanced behavior for the GC and also driver/workers are working more efficiently into some extent.

Thanks again,

Shahab

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group