Hi,
This happened across 4x seemingly unrelated workflows at the same time of the day - all 4x workflows eventually completed successfully. It appeared that all workflows sat idling despite triggering via the Jobs API. The two symptoms I have observed are for 3x workflows, Databricks intiating its cluster creation only 3hrs(!) after the request was issued via Jobs API and the last workflow, where a cluster was promptly created but idling on a task/notebook for 3hrs despite none of its individual cells reporting any duration longer than a few seconds.
In a nutshell, the workflows weren't delayed in processing, reading, writing or doing any other work. They looked like they all suddenly sat idle for 3hrs at the same time.
Driver logs & Log4js don't reveal anything. We are on the Azure cloud using Spot instances so I was wondering if it could have anything to do with eviction, however nothing in the logs suggests this was happening. Could it be the Azure cloud slow in providing compute?
Before I get dive deeper and get lost in the rabbit hole I wanted to poll the community first.
Thanks
Tim.