Databricks Community

duliu · ‎04-03-2024

Hello,

I have a job cluster running streaming job and it unexpectedly failed on 19th March due to DRIVER_UNAVAILABLE (Request timed out, Driver is temporarily unavailable) in event log. This is the run: https://atlassian-discover.cloud.databricks.com/jobs/323849284041517/runs/395169892801478?o=44820012...

I'm aware of a thread reporting the same problem: https://kb.databricks.com/en_US/jobs/driver-unavailable and it pointed out memory pressure is a common cause. However, according to driver stdout there were only minor GCs that took around 30ms-40ms around the time the driver became unavailable:

I also checked the driver log (log4j logs) and it doesn't have any error messages, a few warning messages are unrelated. In fact the driver even continued outputting logs several minutes after the DRIVER_UNAVAILABLE error message appeared in event log.

I tried loading spark UI but after a long wait with messages saying processing files, it errors with the following message, so I can't see spark history UI as well:

Could anyone help please?