topic Re: Executor heartbeat timed out in Data Engineering

Executor heartbeat timed out

nadia — Sun, 12 Jun 2022 21:19:33 GMT

Hello, I'm trying to read a table that is located on Postgreqsl and contains 28 million rows. I have the following result:

"SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.139.64.6 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 161734 ms"

Could you help me please?

Thanks

Re: Executor heartbeat timed out

Prabakar — Tue, 05 Jul 2022 14:50:28 GMT

This could be because of two reasons, either scalability or timeout.

For scalability - You can consider increasing the node type.

For timeout - you can set the below in the cluster spark config.

spark.executor.heartbeatInterval 300s

spark.network.timeout 320s

Re: Executor heartbeat timed out

jose_gonzalez — Fri, 08 Jul 2022 00:26:14 GMT

Hi @Boumaza nadia ,

Did you check the executor 3 logs when the cluster was active? if you get this error message again, I will highly recommend to check the executor's logs to be sure on what was the cause of the issue.

Re: Executor heartbeat timed out

SparkJun — Tue, 18 Jun 2024 20:52:44 GMT

Please also review the Spark UI to see the failed Spark job and Spark stage. Please check on the GC time and data spill to memory and disk. See if there is any error in the failed task in the Spark stage view. This will confirm data skew or GC/memory issues with the executors.

Then, also add spark.task.cpus 2 to the spark config to allocate two cores to run one task.

Re: Executor heartbeat timed out

nlnsha — Fri, 18 Oct 2024 06:08:35 GMT

I set this properties to cluster level, but issue doesn't gets resolved

I am trying to read jdbc oracle table and write in unity catalog.

when i give high number of .option("numPartitions", partitions)\ like 100 or 50 to achieve maximum parallelism, then i get this heartbeat timed out issue

Cluster conf: i have (20 cores 140 GB) 5 min machines on my cluster with auto-scaling set to 10

but when i reduce this to num partitions 25, the issue doesn't occurs and everything runs fine

data is few tables with data around this 173313859

Any reasoning for this?