cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Why do I always see "Executor heartbeat timed out" messages in the Spark Driver logs

brickster_2018
Esteemed Contributor

Often, I see "Executor heartbeat timed out" messages in the Spark driver logs. Sometimes job fails with this error.

Will increasing "spark.executor.heartbeatInterval" help to mitigate the issue ?

1 ACCEPTED SOLUTION

Accepted Solutions

brickster_2018
Esteemed Contributor

This is a common misconception that increasing "spark.executor.heartbeatInterval" will help to mitigate or resolve the heartbeat issues. In fact, increasing the spark.executor.heartbeatInterval will increase the chance of the error and worse the situation.

This is because "spark.executor.heartbeatInterval" determines the interval in which the heartbeat has to be sent. Increasing it will reduce the number of heart beats sent and when the Spark driver checks for the heartbeat every 2 minutes, there is more chance for failure.

To mitigate the issue "spark.network.timeout" can be increased. May to 300 s. Setting a very high value for spark.network.timeout is not recommended as that would delay the true failures.

View solution in original post

1 REPLY 1

brickster_2018
Esteemed Contributor

This is a common misconception that increasing "spark.executor.heartbeatInterval" will help to mitigate or resolve the heartbeat issues. In fact, increasing the spark.executor.heartbeatInterval will increase the chance of the error and worse the situation.

This is because "spark.executor.heartbeatInterval" determines the interval in which the heartbeat has to be sent. Increasing it will reduce the number of heart beats sent and when the Spark driver checks for the heartbeat every 2 minutes, there is more chance for failure.

To mitigate the issue "spark.network.timeout" can be increased. May to 300 s. Setting a very high value for spark.network.timeout is not recommended as that would delay the true failures.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group