Databricks Community

nadia · ‎06-12-2022

Hello, I'm trying to read a table that is located on Postgreqsl and contains 28 million rows. I have the following result:

"SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.139.64.6 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 161734 ms"

Could you help me please?

Thanks

Prabakar · ‎07-05-2022

This could be because of two reasons, either scalability or timeout.

For scalability - You can consider increasing the node type.

For timeout - you can set the below in the cluster spark config.

spark.executor.heartbeatInterval 300s

spark.network.timeout 320s

View solution in original post

Prabakar · ‎07-05-2022

This could be because of two reasons, either scalability or timeout.

For scalability - You can consider increasing the node type.

For timeout - you can set the below in the cluster spark config.

spark.executor.heartbeatInterval 300s

spark.network.timeout 320s

nlnsha · ‎10-17-2024

I set this properties to cluster level, but issue doesn't gets resolved

I am trying to read jdbc oracle table and write in unity catalog.

when i give high number of .option("numPartitions", partitions)\ like 100 or 50 to achieve maximum parallelism, then i get this heartbeat timed out issue

Cluster conf: i have (20 cores 140 GB) 5 min machines on my cluster with auto-scaling set to 10

but when i reduce this to num partitions 25, the issue doesn't occurs and everything runs fine

data is few tables with data around this 173313859

Any reasoning for this?

jose_gonzalez · ‎07-07-2022

Hi @Boumaza nadia ,

Did you check the executor 3 logs when the cluster was active? if you get this error message again, I will highly recommend to check the executor's logs to be sure on what was the cause of the issue.

SparkJun · ‎06-18-2024

Please also review the Spark UI to see the failed Spark job and Spark stage. Please check on the GC time and data spill to memory and disk. See if there is any error in the failed task in the Spark stage view. This will confirm data skew or GC/memory issues with the executors.

Then, also add spark.task.cpus 2 to the spark config to allocate two cores to run one task.

Databricks Community

Executor heartbeat timed out

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences