Troubleshooting Cluster

AndySkinner — Wed, 18 Sep 2024 18:24:10 GMT

We had a failure on a previously running fact table load (our biggest one) and it looked like an executor was failing due to a timeout error. As a test we upped the cluster size and changed the spark.executor.heartbeatinterval to 300s and the spark.network.timeout to 600s. However the particular job still fails (reporting a "Executor heartbeat timed out after XXXXX ms). Looking further in the logs we noted an error message with the code XXKDA with not a lot of other info. Looking into that error message the dbricks website suggests a bug report is in order, although not sure on that.

Anyone got anything else that we could check? or what the XXKDA error message could mean? We`re currently trying a remedial action of reducing the amount of data (it`s a large merge on a fact table).

Re: Troubleshooting Cluster

Ismael-K — Tue, 04 Feb 2025 00:35:46 GMT

The XXKDA error code is a general indicator for task scheduler issues or SPARK_JOB_CANCELLED.

topic Troubleshooting Cluster in Administration & Architecture

Troubleshooting Cluster

Re: Troubleshooting Cluster