Databricks Community

jeremy98 · ‎01-28-2026

Hello community,

In the near future, I need to use spot instances to reduce the cost of running a batch processing job.
My question is: how can I manage my code to properly handle and capture a SIGTERM signal?

Is there any documentation or guidance you can point me to for managing applications running on spot instances?

Kind regards,
Jeremy

Louis_Frolio · ‎01-28-2026

Hey @jeremy98 , here are some suggestions but please test in a safe environment as I cannot guarentee the desired outcome.

You can absolutely run the driver on on-demand and the workers on spot in Databricks. There are two clean ways to do it, depending on whether you want to use instance pools.

Option A is the simplest: no pools. You let the cluster manage placement by setting availability to SPOT_WITH_FALLBACK and first_on_demand = 1. That guarantees the driver comes up on on-demand, while executors prefer spot and gracefully fall back if spot capacity isn’t there.

resources:
  clusters:
    etl_cluster_spot_fallback:
      cluster_name: "etl-cluster-spot-fallback"
      spark_version: "auto:latest-lts"
      node_type_id: "i3.xlarge"
      autotermination_minutes: 60

      aws_attributes:
        availability: "SPOT_WITH_FALLBACK"
        first_on_demand: 1
        spot_bid_price_percent: 100

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

What’s happening here is straightforward: the driver is pinned to on-demand, workers prefer spot, and Databricks handles fallback automatically if spot capacity dries up. For many teams, this is the best balance of simplicity and resilience.

Option B uses pools, which gives you tighter control and faster startup, but requires a bit more plumbing. Because pools are either all spot or all on-demand, mixing driver and workers means two pools.

resources:
  instance_pools:
    driver_on_demand_pool:
      instance_pool_name: "driver-on-demand-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 10
      aws_attributes:
        availability: "ON_DEMAND"

    workers_spot_pool:
      instance_pool_name: "workers-spot-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 50
      aws_attributes:
        availability: "SPOT"
        spot_bid_price_percent: 100

  clusters:
    etl_cluster_pools_hybrid:
      cluster_name: "etl-cluster-pools-hybrid"
      spark_version: "auto:latest-lts"
      autotermination_minutes: 60

      driver_instance_pool_id: ${resources.instance_pools.driver_on_demand_pool.id}
      instance_pool_id:        ${resources.instance_pools.workers_spot_pool.id}

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

A couple of important nuances here. Pools are all-or-nothing for spot vs on-demand, so separate pools are required. Also, on pool-backed clusters, availability behavior is governed by the pool itself. In other words, fallback is much easier to reason about when you’re not using pools.

Now, when should you actually use spot?

There’s no magic “X minutes” threshold. It really comes down to interruption tolerance.

Spot works well for retryable, checkpointed batch ETL, ML training jobs that can resume, and anything where a restart is annoying but not catastrophic. Pairing spot workers with an on-demand driver and fallback gives you a very solid cost/performance tradeoff.

Be cautious with always-on streaming, tight SLAs, or capacity-constrained regions. In those cases, eviction risk can outweigh the savings. If you must use spot, keep the driver on on-demand and be explicit about your retry and checkpoint strategy.

Practical takeaway:

If you want the easiest, safest setup, skip pools and use SPOT_WITH_FALLBACK with first_on_demand = 1. If you need faster startup or tighter capacity control, use pools — just remember that mixing spot and on-demand always means multiple pools.

Cheers, Louis.

View solution in original post

Louis_Frolio · ‎01-28-2026

Greetings @jeremy98 , I did some research and found some helpful hints/tips to help you think about your scenario.

Running batch jobs cost-effectively on spot instances really comes down to two things working together:

a resilient Databricks/Spark configuration, and
simple signal handling in your application so it can shut down cleanly when capacity is reclaimed.

What to configure on Databricks for spot instances

First, keep the driver on on-demand and use spot for workers. This is the single most important guardrail. Losing executors is survivable; losing the driver is not. In the UI, that means the first node is on-demand and subsequent nodes can be spot. Also avoid using a spot pool for the driver type.

Second, enable decommissioning on clusters that use spot. When a preemption notice arrives (typically 30 seconds to a couple of minutes depending on cloud), Spark will try to migrate shuffle and RDD data off the executor that’s about to go away. This dramatically reduces failures and recomputation, and importantly, failures caused by preemption aren’t counted as normal task failures.

If you’re seeing frequent spot loss, that’s usually a signal to reconsider instance types or the workload itself. Some instance families are reclaimed far more aggressively than others, and certain jobs just aren’t good candidates for spot.

For better acquisition and price stability, consider Fleet or flexible node types and pools. This lets Databricks choose from multiple equivalent instance types and improves your odds of getting spot capacity at a reasonable price. You can also cap spot pricing as a percentage of on-demand.

From a planning perspective, Databricks’ cost-optimization guidance consistently recommends spot workers with an on-demand driver, with the trade-off being stricter SLAs. On Azure in particular, Databricks will automatically replace evicted spot workers with pay-as-you-go nodes so jobs can continue predictably, even though eviction notices may be short.

Finally, if you’re hitting stage failures during shuffles, that’s a classic symptom of spot reclamation mid-execution. Decommissioning is the first fix; switching critical paths to on-demand workers is the fallback.

Capturing SIGTERM for graceful shutdown

When a cloud provider reclaims a spot instance, the OS receives a termination signal. Your application should trap SIGTERM, clean up, and exit quickly. Here are minimal patterns you can drop into batch apps.

Python:

import signal, sys, time
shutting_down = False

def handle_sigterm(signum, frame):
    global shutting_down
    shutting_down = True
    # flush logs, close connections, persist checkpoints

signal.signal(signal.SIGTERM, handle_sigterm)

def main():
    while not shutting_down:
        # do work in small, idempotent units
        time.sleep(0.5)
    sys.exit(0)

if __name__ == "__main__":
    main()

Java:

public class App {
  public static void main(String[] args) {
    Runtime.getRuntime().addShutdownHook(new Thread(() -> {
      // flush, close, checkpoint
    }));
    // main job logic
  }
}

Scala:

object App extends App {
  sys.addShutdownHook {
    // flush, close, checkpoint
  }
  // main job logic
}

Bash:

trap 'cleanup; exit 143' TERM INT

A few Spark-specific tips that really matter here. Make your work idempotent and commit in small units. Writing to Delta in batches or using foreachBatch in Structured Streaming means a retry after preemption won’t corrupt outputs.

For streaming jobs, keep a reference to the StreamingQuery and stop it in the signal handler so Spark can close sources and sinks cleanly:

query = df.writeStream.option("checkpointLocation", "...").start("...")

def handle_sigterm(signum, frame):
    query.stop()

signal.signal(signal.SIGTERM, handle_sigterm)
query.awaitTermination()

The sweet spot is combining app-level signal handling with Databricks decommissioning. Spark migrates executor state, and your process exits politely.

Practical checklist

Use Jobs Compute with spot workers and an on-demand driver.

Enable decommissioning and data migration.

Trap SIGTERM and flush/close quickly; keep outputs idempotent.

Right-size instance types and consider Fleet/flexible nodes.

If evictions are constant, temporarily move critical workloads to on-demand.

Hope this helps, Louis.

jeremy98 · ‎01-28-2026

Hi @Louis_Frolio,

Thanks for your complete answer! I really appreciate that.

I've two questions:

1. For setting a driver node as on-demand and my workers as spot instances. How is it possible to do? Because If I would to create a new compute pool, I couldn't decide which is spot compute and which not. As you can see below:

Maybe, do I need to create two different pools one for the driver and one for the workers and create one all on-demand and the other pool all spot-instances?

Can you provide me the yaml boilerplate for writing it in a resource file, we never did it!

2. When is it suggested to use Spot Instances? When the job execution is more than xx minutes or..? I read something around and I'm thinking that our job maybe doesn't need spot instances at all!

Louis_Frolio · ‎01-28-2026

Hey @jeremy98 , here are some suggestions but please test in a safe environment as I cannot guarentee the desired outcome.

You can absolutely run the driver on on-demand and the workers on spot in Databricks. There are two clean ways to do it, depending on whether you want to use instance pools.

Option A is the simplest: no pools. You let the cluster manage placement by setting availability to SPOT_WITH_FALLBACK and first_on_demand = 1. That guarantees the driver comes up on on-demand, while executors prefer spot and gracefully fall back if spot capacity isn’t there.

resources:
  clusters:
    etl_cluster_spot_fallback:
      cluster_name: "etl-cluster-spot-fallback"
      spark_version: "auto:latest-lts"
      node_type_id: "i3.xlarge"
      autotermination_minutes: 60

      aws_attributes:
        availability: "SPOT_WITH_FALLBACK"
        first_on_demand: 1
        spot_bid_price_percent: 100

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

What’s happening here is straightforward: the driver is pinned to on-demand, workers prefer spot, and Databricks handles fallback automatically if spot capacity dries up. For many teams, this is the best balance of simplicity and resilience.

Option B uses pools, which gives you tighter control and faster startup, but requires a bit more plumbing. Because pools are either all spot or all on-demand, mixing driver and workers means two pools.

resources:
  instance_pools:
    driver_on_demand_pool:
      instance_pool_name: "driver-on-demand-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 10
      aws_attributes:
        availability: "ON_DEMAND"

    workers_spot_pool:
      instance_pool_name: "workers-spot-pool"
      node_type_id: "i3.xlarge"
      min_idle_instances: 0
      max_capacity: 50
      aws_attributes:
        availability: "SPOT"
        spot_bid_price_percent: 100

  clusters:
    etl_cluster_pools_hybrid:
      cluster_name: "etl-cluster-pools-hybrid"
      spark_version: "auto:latest-lts"
      autotermination_minutes: 60

      driver_instance_pool_id: ${resources.instance_pools.driver_on_demand_pool.id}
      instance_pool_id:        ${resources.instance_pools.workers_spot_pool.id}

      autoscale:
        min_workers: 2
        max_workers: 8

      data_security_mode: "USER_ISOLATION"
      custom_tags:
        Project: "my-etl"
        Environment: "prod"

A couple of important nuances here. Pools are all-or-nothing for spot vs on-demand, so separate pools are required. Also, on pool-backed clusters, availability behavior is governed by the pool itself. In other words, fallback is much easier to reason about when you’re not using pools.

Now, when should you actually use spot?

There’s no magic “X minutes” threshold. It really comes down to interruption tolerance.

Spot works well for retryable, checkpointed batch ETL, ML training jobs that can resume, and anything where a restart is annoying but not catastrophic. Pairing spot workers with an on-demand driver and fallback gives you a very solid cost/performance tradeoff.

Be cautious with always-on streaming, tight SLAs, or capacity-constrained regions. In those cases, eviction risk can outweigh the savings. If you must use spot, keep the driver on on-demand and be explicit about your retry and checkpoint strategy.

Practical takeaway:

If you want the easiest, safest setup, skip pools and use SPOT_WITH_FALLBACK with first_on_demand = 1. If you need faster startup or tighter capacity control, use pools — just remember that mixing spot and on-demand always means multiple pools.

Cheers, Louis.

jeremy98 · ‎01-28-2026

Thanks again @Louis_Frolio, awesome!

Your explanations are clear :), so this means that if we would move to option A we still need to manage the sigterm, right or not?

Kind regards,

Jeremy

Louis_Frolio · ‎01-28-2026

Short answer: No — you don’t need to implement your own SIGTERM handler, even with option 2 (separate on-demand driver pool + spot worker pool). Databricks already detects cloud spot-preemption notices and automatically kicks off Spark’s decommissioning path. As long as decommissioning is enabled, Spark will gracefully drain work and migrate data for you. There’s no need to trap OS signals in your job code.

What to enable on the cluster (this applies whether or not you use pools):

spark.decommission.enabled=true

Turns on executor decommissioning.
spark.storage.decommission.enabled=true

Allows cached data to be moved during decommission.
spark.storage.decommission.shuffleBlocks.enabled=true

Migrates shuffle blocks off the executor being reclaimed.
spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled=true

Refreshes block locations to reduce downstream fetch failures.
Optional (if you manage worker environments directly):

SPARK_WORKER_OPTS=”-Dspark.decommission.enabled=true”

Why this is sufficient:

When a spot instance is about to be reclaimed, Databricks receives the cloud interruption notice (typically 30 seconds to a couple of minutes in advance) and issues a decommission request to Spark. Spark then drains tasks and migrates state automatically — no custom SIGTERM handling required.
These settings significantly reduce shuffle fetch failures and recomputation when a spot worker disappears. Job-level retries are still a good safety net for truly abrupt losses, but most preemptions are handled cleanly.

Note on pools:

A spot worker pool is still subject to eviction. Decommissioning is the recommended mitigation. On Azure in particular, evicted spot capacity does not automatically fall back to on-demand, which makes decommissioning plus sensible retries even more important.

Kirankumarbs · ‎01-28-2026

@Louis_Frolio Thanks for the comprehensive answers!

In my case, we do have a few workflows that are using single-node clusters. Do you have any suggestions in this case?

Louis_Frolio · ‎02-05-2026

@Kirankumarbs — in the case of a single-node cluster, well… you get what you get 🙂. A single-node cluster can only run on an on-demand instance. That one node hosts both the driver and the executor, so if the node goes away, there’s no recovery path.

Hope this helps, Louis.

Databricks Community

How can I manage the code on using a Spot Instance?

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap