I have a Databricks job that runs daily at 14:00 IST and typically finishes in about 2 hours. However, yesterday, the job got stuck and continued running indefinitely. After exceeding 5 hours, I canceled it and reran the job, which then completed successfully in 2 hours. I'm unsure why the job didn't stop running the first time, the cluster's compute metrics appeared normal. Upon reviewing the cluster log, I found the following standard error:
appcds_setup elapsed time: 0.000
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
chown: invalid group: ':spark-users'
Mon Sep 9 08:39:27 2024 Connection to spark from PID 1323
Mon Sep 9 08:39:27 2024 Initialized gateway on port 38871
Mon Sep 9 08:39:29 2024 Connected to spark.
Mon Sep 9 08:41:19 2024 Connection to spark from PID 1809
Mon Sep 9 08:41:19 2024 Initialized gateway on port 44921
Mon Sep 9 08:41:19 2024 Connection to spark from PID 1826
Mon Sep 9 08:41:19 2024 Initialized gateway on port 35631
Mon Sep 9 08:41:22 2024 Connected to spark.
Mon Sep 9 08:41:22 2024 Connected to spark.
Mon Sep 9 08:41:28 2024 Connection to spark from PID 1881
Mon Sep 9 08:41:28 2024 Initialized gateway on port 33377
Mon Sep 9 08:41:32 2024 Connected to spark.
Mon Sep 9 08:41:46 2024 Connection to spark from PID 1971
Mon Sep 9 08:41:46 2024 Initialized gateway on port 44409
Mon Sep 9 08:41:50 2024 Connected to spark.
Mon Sep 9 08:42:01 2024 Connection to spark from PID 2030
Mon Sep 9 08:42:01 2024 Initialized gateway on port 39285
Mon Sep 9 08:42:05 2024 Connected to spark.
Could someone please help identify what might have caused this issue?