Databricks Community

alonisser · ‎04-10-2025

In the last few days, I've encountered in Azure (and before that also in AWS, but a bit different) this message about failing to start a cluster

"run failed with error message Cluster '0410-173007-1pjmdgi1' was terminated. Reason: INVALID_ARGUMENT (CLIENT_ERROR). Parameters: databricks_error_message:Allocation failed. VM(s) with the following constraints cannot be allocated, because the condition is too restrictive. Please remove some constraints and try again"

This can happen recurring for 12 hours of jobs trying to start, but when I removed the "optional spots" to fully "on demand" it did work as expected. What I don't get here is that the configuration is about "falling back to on demand if spots aren't available" why would it fail the job completely from starting if there isn't a spot available? I don't get it.

Note: we're talking T4 low-end GPU If that makes a difference

(in Azure it's mostly old DBR versions 10.4 and 11.3, but in AW,S also in LTS 15 - all with ML version)

Am I doing something wrong here? Is there a way to mitigate this?

Louis_Frolio · ‎04-10-2025

The issue you're encountering with cluster startup failures in Databricks, particularly when using a mix of spot and on-demand instances, is likely due to constraints in the cloud provider's capacity or configuration mismatches. Here's a detailed explanation and potential mitigation strategies:

---

Why the Cluster Fails Despite Spot Fallback
1. Spot Instance Constraints:
- Spot instances are subject to availability in the cloud provider's pool. If there isn't sufficient capacity for the instance type you requested (e.g., T4 GPUs), the cluster cannot acquire these resources. Even with fallback enabled, if the fallback mechanism encounters restrictive constraints (e.g., specific VM types or zones), it may fail entirely.

2. Fallback Mechanism Limitations:
- Databricks' "fallback to on-demand" feature is designed to replace unavailable spot instances with on-demand instances. However, this requires that on-demand capacity is available and meets all specified constraints (e.g., instance type, region). If these constraints are too restrictive, the fallback mechanism may not succeed.

3. Driver Dependency:
- If the driver node is configured as a spot instance and gets evicted or fails to launch, the entire cluster will terminate. Databricks recommends always using an on-demand instance for the driver to ensure cluster stability.

4. Older Databricks Runtime Versions:
- Using older runtime versions (like 10.4 or 11.3) might lack optimizations for handling mixed-instance configurations effectively. Upgrading to newer Long-Term Support (LTS) versions could improve reliability.

---

Mitigation Strategies
Cluster Configuration Adjustments**
1. Use On-Demand for Driver and Critical Workers:
- Ensure that the driver node is always on an on-demand instance. For worker nodes, use a mix of spot and on-demand instances based on workload criticality.

2. Relax Constraints:
- Review and loosen any restrictive constraints in your cluster configuration (e.g., specific VM types or regions). This increases the likelihood of resource allocation during fallback.

3. Enable Autoscaling:
- Configure autoscaling clusters to dynamically adjust resources based on workload demand and availability, reducing reliance on fixed configurations.

Fallback Optimization
1. Spread Across Multiple Availability Zones:
- Configure your clusters to use multiple availability zones or regions to increase the chances of finding spot capacity.

Instance Type Flexibility:
- Use fleets or allow multiple instance types in your configuration to improve allocation success rates for both spot and fallback nodes.This can reduce errors caused by spot instance interruptions but does not guarantee complete data migration.

Hope this helps. Louis.

alonisser · ‎04-10-2025

Thanks @Louis_Frolio ,

To be precise

1. It's not generally constrained, just for spots, as removing spots but with the same instance type works

2. it's a very small cluster, so no point in autoscaling, if it can't take 2 workers, then it won't take more ..
3. Driver is already on demand

4. Generally I think it's a bug with strange behavior, my guess it would work if the job was already running and the instance is evicted, but something isn't working correctly if spots aren't available on "cluster start"

5. I did see this behavior in AWS with 15.4 so I don't think it's DBR version issue, again, I think it's a core bug in the behavior . but if you can't point me to where this was fixed I'd be glad, maybe, I'm mis interperting what I saw with15.4

I'll read about fleets and multiple zones, maybe it can help