Re: How can I manage the code on using a Spot Inst...

Louis_Frolio · ‎01-28-2026

Hey @jeremy98 , here are some suggestions but please test in a safe environment as I cannot guarentee the desired outcome.

You can absolutely run the driver on on-demand and the workers on spot in Databricks. There are two clean ways to do it, depending on whether you want to use instance pools.

Option A is the simplest: no pools. You let the cluster manage placement by setting availability to SPOT_WITH_FALLBACK and first_on_demand = 1. That guarantees the driver comes up on on-demand, while executors prefer spot and gracefully fall back if spot capacity isn’t there.

resources:
  clusters:
    etl_cluster_spot_fallback:
      cluster_name: "etl-cluster-spot-fallback"
      spark_version: "auto:latest-lts"
      node_type_id: "i3.xlarge"
      autotermination_minutes: 60

      aws_attributes:
        availability: "SPOT_WITH_FALLBACK"
        first_on_demand: 1
        spot_bid_price_percent: 100

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

What’s happening here is straightforward: the driver is pinned to on-demand, workers prefer spot, and Databricks handles fallback automatically if spot capacity dries up. For many teams, this is the best balance of simplicity and resilience.

Option B uses pools, which gives you tighter control and faster startup, but requires a bit more plumbing. Because pools are either all spot or all on-demand, mixing driver and workers means two pools.

resources:
  instance_pools:
    driver_on_demand_pool:
      instance_pool_name: "driver-on-demand-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 10
      aws_attributes:
        availability: "ON_DEMAND"

    workers_spot_pool:
      instance_pool_name: "workers-spot-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 50
      aws_attributes:
        availability: "SPOT"
        spot_bid_price_percent: 100

  clusters:
    etl_cluster_pools_hybrid:
      cluster_name: "etl-cluster-pools-hybrid"
      spark_version: "auto:latest-lts"
      autotermination_minutes: 60

      driver_instance_pool_id: ${resources.instance_pools.driver_on_demand_pool.id}
      instance_pool_id:        ${resources.instance_pools.workers_spot_pool.id}

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

A couple of important nuances here. Pools are all-or-nothing for spot vs on-demand, so separate pools are required. Also, on pool-backed clusters, availability behavior is governed by the pool itself. In other words, fallback is much easier to reason about when you’re not using pools.

Now, when should you actually use spot?

There’s no magic “X minutes” threshold. It really comes down to interruption tolerance.

Spot works well for retryable, checkpointed batch ETL, ML training jobs that can resume, and anything where a restart is annoying but not catastrophic. Pairing spot workers with an on-demand driver and fallback gives you a very solid cost/performance tradeoff.

Be cautious with always-on streaming, tight SLAs, or capacity-constrained regions. In those cases, eviction risk can outweigh the savings. If you must use spot, keep the driver on on-demand and be explicit about your retry and checkpoint strategy.

Practical takeaway:

If you want the easiest, safest setup, skip pools and use SPOT_WITH_FALLBACK with first_on_demand = 1. If you need faster startup or tighter capacity control, use pools — just remember that mixing spot and on-demand always means multiple pools.

Cheers, Louis.

View solution in original post