Difficult to know but maybe it has to do with usage of spot instances as it seems root cause is kind of random. In theory spot instances can be terminated at any time by cloud provider if it needs the capacity back, BUT databricks should handle this fact correctly to replace lost spot workers or apply resilient policies to avoid that type of errors.
So, I can't ensure that is your issue. However, you can try to disable that option for a given time taking into account that costs will be a little higher. In anycase, don't use "spot" instances in PROD unless your workloads can afford breaks.