cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Executor heartbeat timed out

nadia
New Contributor II

Hello, I'm trying to read a table that is located on Postgreqsl and contains 28 million rows. I have the following result:

"SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.139.64.6 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 161734 ms"

Could you help me please?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

Prabakar
Databricks Employee
Databricks Employee

This could be because of two reasons, either scalability or timeout.

For scalability - You can consider increasing the node type.

For timeout - you can set the below in the cluster spark config.

spark.executor.heartbeatInterval 300s

spark.network.timeout 320s

View solution in original post

4 REPLIES 4

Prabakar
Databricks Employee
Databricks Employee

This could be because of two reasons, either scalability or timeout.

For scalability - You can consider increasing the node type.

For timeout - you can set the below in the cluster spark config.

spark.executor.heartbeatInterval 300s

spark.network.timeout 320s

nlnsha
New Contributor II

I set this properties  to cluster level, but issue doesn't gets resolved

I am trying to read jdbc oracle table and write in unity catalog.

when i give high number of .option("numPartitions", partitions)\ like 100 or 50 to achieve maximum parallelism, then i get this heartbeat timed out issue

Cluster conf: i have (20 cores 140 GB) 5 min machines on my cluster with auto-scaling set to 10

but when i reduce this to num partitions 25, the issue doesn't occurs and everything runs fine

data is few tables with data around this 173313859

Any reasoning for this?

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Boumaza nadia​ ,

Did you check the executor 3 logs when the cluster was active? if you get this error message again, I will highly recommend to check the executor's logs to be sure on what was the cause of the issue.

SparkJun
Databricks Employee
Databricks Employee

Please also review the Spark UI to see the failed Spark job and Spark stage. Please check on the GC time and data spill to memory and disk. See if there is any error in the failed task in the Spark stage view. This will confirm data skew or GC/memory issues with the executors.

Then, also add spark.task.cpus 2 to the spark config to allocate two cores to run one task. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group