cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks job keep getting failed due to executor lost.

amitkmaurya
New Contributor II

Getting following error while saving a dataframe partitioned by two columns.

Job aborted due to stage failure: Task 5774 in stage 33.0 failed 4 times, most recent failure: Lost task 5774.3 in stage 33.0 (TID 7736) (13.2.96.110 executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with code 137

 Please help me why I am getting this error and how can this be solved.

Driver + executor 64gb/16cores

2 ACCEPTED SOLUTIONS

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @amitkmaurya , The error message you’re encountering indicates that your Spark job failed due to a stage failure. 

  1. Task Failure and Exit Code 137:

    • The error message mentions that Task 5774 in stage 33.0 failed 4 times, with the most recent failure being caused by an executor (executor 7) exiting due to an issue.
    • The exit code 137 typically corresponds to a process being killed by the operating system due to excessive memory usage (specifically, a SIGKILL signal).
    • This suggests that your Spark job is running out of memory or encountering resource-related issues.
  2. Possible Solutions:

    • Here are some steps you can take to address this issue:
      • Memory Configuration:
        • Check the memory configuration for both the driver and the executor. Ensure that they have sufficient memory allocated.
        • Consider adjusting the memory settings based on the available resources.
      • Resource Allocation:
        • Verify that the total memory available (64GB) is distributed appropriately between the driver and the executor.
        • If possible, increase the memory allocation for the executor.
      • Task Serialization:
        • Ensure that the objects being processed in your Spark job are serializable. Non-serializable objects can cause issues.
        • If you’re using custom classes, make sure they implement the Serializable interface.
      • Dependency Issues:
        • Sometimes, dependency conflicts can lead to unexpected behavior.
        • Make sure that your application’s dependencies (such as libraries, JAR files, etc.) are compatible with Spark 1.1.0.
      • Check for Corrupted Data:
        • If your data source contains corrupted files, it can cause failures.
        • Verify the integrity of your data files and remove any corrupted files.
      • Logging and Debugging:
        • Enable detailed logging to identify any specific issues.
        • Check the Spark logs (both driver and executor) for additional information.
      • Cluster Health:
        • Ensure that your Spark standalone cluster is healthy and all nodes are functioning properly.
        • Monitor resource usage during job execution.
      • Upgrade Spark Version:
        • Consider upgrading to a more recent version of Spark (if possible) to benefit from bug fixes and improvements.
  3. Additional Considerations:

    • If you’re using Delta Lake, consider running FSCK REPAIR TABLE delta. to repair any inconsistencies in the underlying files.
    • Review your application code to identify any potential bottlenecks or inefficiencies.

Start by checking memory allocation, serialization, and dependencies.

If the problem persists, delve deeper into logs and diagnostics to pinpoint the root cause123.

I hope this helps you resolve the issue! If you need further assistance, feel free to ask. 😊

 

View solution in original post

amitkmaurya
New Contributor II

Hi, 

I have solved the problem with the same workers and driver.

In my case data skewness was the problem.

Adding repartition to the dataframe just before writing, evenly distributed the data across the nodes and this stage failure resolved.

Thanks @Kaniz for your insoghts.

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @amitkmaurya , The error message you’re encountering indicates that your Spark job failed due to a stage failure. 

  1. Task Failure and Exit Code 137:

    • The error message mentions that Task 5774 in stage 33.0 failed 4 times, with the most recent failure being caused by an executor (executor 7) exiting due to an issue.
    • The exit code 137 typically corresponds to a process being killed by the operating system due to excessive memory usage (specifically, a SIGKILL signal).
    • This suggests that your Spark job is running out of memory or encountering resource-related issues.
  2. Possible Solutions:

    • Here are some steps you can take to address this issue:
      • Memory Configuration:
        • Check the memory configuration for both the driver and the executor. Ensure that they have sufficient memory allocated.
        • Consider adjusting the memory settings based on the available resources.
      • Resource Allocation:
        • Verify that the total memory available (64GB) is distributed appropriately between the driver and the executor.
        • If possible, increase the memory allocation for the executor.
      • Task Serialization:
        • Ensure that the objects being processed in your Spark job are serializable. Non-serializable objects can cause issues.
        • If you’re using custom classes, make sure they implement the Serializable interface.
      • Dependency Issues:
        • Sometimes, dependency conflicts can lead to unexpected behavior.
        • Make sure that your application’s dependencies (such as libraries, JAR files, etc.) are compatible with Spark 1.1.0.
      • Check for Corrupted Data:
        • If your data source contains corrupted files, it can cause failures.
        • Verify the integrity of your data files and remove any corrupted files.
      • Logging and Debugging:
        • Enable detailed logging to identify any specific issues.
        • Check the Spark logs (both driver and executor) for additional information.
      • Cluster Health:
        • Ensure that your Spark standalone cluster is healthy and all nodes are functioning properly.
        • Monitor resource usage during job execution.
      • Upgrade Spark Version:
        • Consider upgrading to a more recent version of Spark (if possible) to benefit from bug fixes and improvements.
  3. Additional Considerations:

    • If you’re using Delta Lake, consider running FSCK REPAIR TABLE delta. to repair any inconsistencies in the underlying files.
    • Review your application code to identify any potential bottlenecks or inefficiencies.

Start by checking memory allocation, serialization, and dependencies.

If the problem persists, delve deeper into logs and diagnostics to pinpoint the root cause123.

I hope this helps you resolve the issue! If you need further assistance, feel free to ask. 😊

 

amitkmaurya
New Contributor II

Hi, 

I have solved the problem with the same workers and driver.

In my case data skewness was the problem.

Adding repartition to the dataframe just before writing, evenly distributed the data across the nodes and this stage failure resolved.

Thanks @Kaniz for your insoghts.