We had a failure on a previously running fact table load (our biggest one) and it looked like an executor was failing due to a timeout error. As a test we upped the cluster size and changed the spark.executor.heartbeatinterval to 300s and the spark.network.timeout to 600s. However the particular job still fails (reporting a "Executor heartbeat timed out after XXXXX ms). Looking further in the logs we noted an error message with the code XXKDA with not a lot of other info. Looking into that error message the dbricks website suggests a bug report is in order, although not sure on that.
Anyone got anything else that we could check? or what the XXKDA error message could mean? We`re currently trying a remedial action of reducing the amount of data (it`s a large merge on a fact table).