โ01-28-2026 07:42 AM
Hello community,
In the near future, I need to use spot instances to reduce the cost of running a batch processing job.
My question is: how can I manage my code to properly handle and capture a SIGTERM signal?
Is there any documentation or guidance you can point me to for managing applications running on spot instances?
Kind regards,
Jeremy
โ01-28-2026 08:33 AM
Hey @jeremy98 , here are some suggestions but please test in a safe environment as I cannot guarentee the desired outcome.
You can absolutely run the driver on on-demand and the workers on spot in Databricks. There are two clean ways to do it, depending on whether you want to use instance pools.
Option A is the simplest: no pools. You let the cluster manage placement by setting availability to SPOT_WITH_FALLBACK and first_on_demand = 1. That guarantees the driver comes up on on-demand, while executors prefer spot and gracefully fall back if spot capacity isnโt there.
resources:
clusters:
etl_cluster_spot_fallback:
cluster_name: "etl-cluster-spot-fallback"
spark_version: "auto:latest-lts"
node_type_id: "i3.xlarge"
autotermination_minutes: 60
aws_attributes:
availability: "SPOT_WITH_FALLBACK"
first_on_demand: 1
spot_bid_price_percent: 100
autoscale:
min_workers: 2
max_workers: 8
data_security_mode: "USER_ISOLATION"
custom_tags:
Project: "my-etl"
Environment: "prod"
Whatโs happening here is straightforward: the driver is pinned to on-demand, workers prefer spot, and Databricks handles fallback automatically if spot capacity dries up. For many teams, this is the best balance of simplicity and resilience.
Option B uses pools, which gives you tighter control and faster startup, but requires a bit more plumbing. Because pools are either all spot or all on-demand, mixing driver and workers means two pools.
resources:
instance_pools:
driver_on_demand_pool:
instance_pool_name: "driver-on-demand-pool"
node_type_id: "i3.xlarge"
min_idle_instances: 0
max_capacity: 10
aws_attributes:
availability: "ON_DEMAND"
workers_spot_pool:
instance_pool_name: "workers-spot-pool"
node_type_id: "i3.xlarge"
min_idle_instances: 0
max_capacity: 50
aws_attributes:
availability: "SPOT"
spot_bid_price_percent: 100
clusters:
etl_cluster_pools_hybrid:
cluster_name: "etl-cluster-pools-hybrid"
spark_version: "auto:latest-lts"
autotermination_minutes: 60
driver_instance_pool_id: ${resources.instance_pools.driver_on_demand_pool.id}
instance_pool_id: ${resources.instance_pools.workers_spot_pool.id}
autoscale:
min_workers: 2
max_workers: 8
data_security_mode: "USER_ISOLATION"
custom_tags:
Project: "my-etl"
Environment: "prod"
A couple of important nuances here. Pools are all-or-nothing for spot vs on-demand, so separate pools are required. Also, on pool-backed clusters, availability behavior is governed by the pool itself. In other words, fallback is much easier to reason about when youโre not using pools.
Now, when should you actually use spot?
Thereโs no magic โX minutesโ threshold. It really comes down to interruption tolerance.
Spot works well for retryable, checkpointed batch ETL, ML training jobs that can resume, and anything where a restart is annoying but not catastrophic. Pairing spot workers with an on-demand driver and fallback gives you a very solid cost/performance tradeoff.
Be cautious with always-on streaming, tight SLAs, or capacity-constrained regions. In those cases, eviction risk can outweigh the savings. If you must use spot, keep the driver on on-demand and be explicit about your retry and checkpoint strategy.
Practical takeaway:
If you want the easiest, safest setup, skip pools and use SPOT_WITH_FALLBACK with first_on_demand = 1. If you need faster startup or tighter capacity control, use pools โ just remember that mixing spot and on-demand always means multiple pools.
Cheers, Louis.
โ01-28-2026 08:11 AM
Greetings @jeremy98 , I did some research and found some helpful hints/tips to help you think about your scenario.
Running batch jobs cost-effectively on spot instances really comes down to two things working together:
a resilient Databricks/Spark configuration, and
simple signal handling in your application so it can shut down cleanly when capacity is reclaimed.
What to configure on Databricks for spot instances
First, keep the driver on on-demand and use spot for workers. This is the single most important guardrail. Losing executors is survivable; losing the driver is not. In the UI, that means the first node is on-demand and subsequent nodes can be spot. Also avoid using a spot pool for the driver type.
Second, enable decommissioning on clusters that use spot. When a preemption notice arrives (typically 30 seconds to a couple of minutes depending on cloud), Spark will try to migrate shuffle and RDD data off the executor thatโs about to go away. This dramatically reduces failures and recomputation, and importantly, failures caused by preemption arenโt counted as normal task failures.
If youโre seeing frequent spot loss, thatโs usually a signal to reconsider instance types or the workload itself. Some instance families are reclaimed far more aggressively than others, and certain jobs just arenโt good candidates for spot.
For better acquisition and price stability, consider Fleet or flexible node types and pools. This lets Databricks choose from multiple equivalent instance types and improves your odds of getting spot capacity at a reasonable price. You can also cap spot pricing as a percentage of on-demand.
From a planning perspective, Databricksโ cost-optimization guidance consistently recommends spot workers with an on-demand driver, with the trade-off being stricter SLAs. On Azure in particular, Databricks will automatically replace evicted spot workers with pay-as-you-go nodes so jobs can continue predictably, even though eviction notices may be short.
Finally, if youโre hitting stage failures during shuffles, thatโs a classic symptom of spot reclamation mid-execution. Decommissioning is the first fix; switching critical paths to on-demand workers is the fallback.
Capturing SIGTERM for graceful shutdown
When a cloud provider reclaims a spot instance, the OS receives a termination signal. Your application should trap SIGTERM, clean up, and exit quickly. Here are minimal patterns you can drop into batch apps.
Python:
import signal, sys, time
shutting_down = False
def handle_sigterm(signum, frame):
global shutting_down
shutting_down = True
# flush logs, close connections, persist checkpoints
signal.signal(signal.SIGTERM, handle_sigterm)
def main():
while not shutting_down:
# do work in small, idempotent units
time.sleep(0.5)
sys.exit(0)
if __name__ == "__main__":
main()
Java:
public class App {
public static void main(String[] args) {
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
// flush, close, checkpoint
}));
// main job logic
}
}
Scala:
object App extends App {
sys.addShutdownHook {
// flush, close, checkpoint
}
// main job logic
}
Bash:
trap 'cleanup; exit 143' TERM INT
A few Spark-specific tips that really matter here. Make your work idempotent and commit in small units. Writing to Delta in batches or using foreachBatch in Structured Streaming means a retry after preemption wonโt corrupt outputs.
For streaming jobs, keep a reference to the StreamingQuery and stop it in the signal handler so Spark can close sources and sinks cleanly:
query = df.writeStream.option("checkpointLocation", "...").start("...")
def handle_sigterm(signum, frame):
query.stop()
signal.signal(signal.SIGTERM, handle_sigterm)
query.awaitTermination()
The sweet spot is combining app-level signal handling with Databricks decommissioning. Spark migrates executor state, and your process exits politely.
Practical checklist
Use Jobs Compute with spot workers and an on-demand driver.
Enable decommissioning and data migration.
Trap SIGTERM and flush/close quickly; keep outputs idempotent.
Right-size instance types and consider Fleet/flexible nodes.
If evictions are constant, temporarily move critical workloads to on-demand.
Hope this helps, Louis.
โ01-28-2026 08:26 AM - edited โ01-28-2026 08:28 AM
Hi @Louis_Frolio,
Thanks for your complete answer! I really appreciate that.
I've two questions:
1. For setting a driver node as on-demand and my workers as spot instances. How is it possible to do? Because If I would to create a new compute pool, I couldn't decide which is spot compute and which not. As you can see below:
Maybe, do I need to create two different pools one for the driver and one for the workers and create one all on-demand and the other pool all spot-instances?
Can you provide me the yaml boilerplate for writing it in a resource file, we never did it!
2. When is it suggested to use Spot Instances? When the job execution is more than xx minutes or..? I read something around and I'm thinking that our job maybe doesn't need spot instances at all!
โ01-28-2026 08:33 AM
Hey @jeremy98 , here are some suggestions but please test in a safe environment as I cannot guarentee the desired outcome.
You can absolutely run the driver on on-demand and the workers on spot in Databricks. There are two clean ways to do it, depending on whether you want to use instance pools.
Option A is the simplest: no pools. You let the cluster manage placement by setting availability to SPOT_WITH_FALLBACK and first_on_demand = 1. That guarantees the driver comes up on on-demand, while executors prefer spot and gracefully fall back if spot capacity isnโt there.
resources:
clusters:
etl_cluster_spot_fallback:
cluster_name: "etl-cluster-spot-fallback"
spark_version: "auto:latest-lts"
node_type_id: "i3.xlarge"
autotermination_minutes: 60
aws_attributes:
availability: "SPOT_WITH_FALLBACK"
first_on_demand: 1
spot_bid_price_percent: 100
autoscale:
min_workers: 2
max_workers: 8
data_security_mode: "USER_ISOLATION"
custom_tags:
Project: "my-etl"
Environment: "prod"
Whatโs happening here is straightforward: the driver is pinned to on-demand, workers prefer spot, and Databricks handles fallback automatically if spot capacity dries up. For many teams, this is the best balance of simplicity and resilience.
Option B uses pools, which gives you tighter control and faster startup, but requires a bit more plumbing. Because pools are either all spot or all on-demand, mixing driver and workers means two pools.
resources:
instance_pools:
driver_on_demand_pool:
instance_pool_name: "driver-on-demand-pool"
node_type_id: "i3.xlarge"
min_idle_instances: 0
max_capacity: 10
aws_attributes:
availability: "ON_DEMAND"
workers_spot_pool:
instance_pool_name: "workers-spot-pool"
node_type_id: "i3.xlarge"
min_idle_instances: 0
max_capacity: 50
aws_attributes:
availability: "SPOT"
spot_bid_price_percent: 100
clusters:
etl_cluster_pools_hybrid:
cluster_name: "etl-cluster-pools-hybrid"
spark_version: "auto:latest-lts"
autotermination_minutes: 60
driver_instance_pool_id: ${resources.instance_pools.driver_on_demand_pool.id}
instance_pool_id: ${resources.instance_pools.workers_spot_pool.id}
autoscale:
min_workers: 2
max_workers: 8
data_security_mode: "USER_ISOLATION"
custom_tags:
Project: "my-etl"
Environment: "prod"
A couple of important nuances here. Pools are all-or-nothing for spot vs on-demand, so separate pools are required. Also, on pool-backed clusters, availability behavior is governed by the pool itself. In other words, fallback is much easier to reason about when youโre not using pools.
Now, when should you actually use spot?
Thereโs no magic โX minutesโ threshold. It really comes down to interruption tolerance.
Spot works well for retryable, checkpointed batch ETL, ML training jobs that can resume, and anything where a restart is annoying but not catastrophic. Pairing spot workers with an on-demand driver and fallback gives you a very solid cost/performance tradeoff.
Be cautious with always-on streaming, tight SLAs, or capacity-constrained regions. In those cases, eviction risk can outweigh the savings. If you must use spot, keep the driver on on-demand and be explicit about your retry and checkpoint strategy.
Practical takeaway:
If you want the easiest, safest setup, skip pools and use SPOT_WITH_FALLBACK with first_on_demand = 1. If you need faster startup or tighter capacity control, use pools โ just remember that mixing spot and on-demand always means multiple pools.
Cheers, Louis.
โ01-28-2026 08:44 AM
Thanks again @Louis_Frolio, awesome!
Your explanations are clear :), so this means that if we would move to option A we still need to manage the sigterm, right or not?
Kind regards,
Jeremy
โ01-28-2026 10:18 AM
Short answer: No โ you donโt need to implement your own SIGTERM handler, even with option 2 (separate on-demand driver pool + spot worker pool). Databricks already detects cloud spot-preemption notices and automatically kicks off Sparkโs decommissioning path. As long as decommissioning is enabled, Spark will gracefully drain work and migrate data for you. Thereโs no need to trap OS signals in your job code.
What to enable on the cluster (this applies whether or not you use pools):
spark.decommission.enabled=true
Turns on executor decommissioning.
spark.storage.decommission.enabled=true
Allows cached data to be moved during decommission.
spark.storage.decommission.shuffleBlocks.enabled=true
Migrates shuffle blocks off the executor being reclaimed.
spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled=true
Refreshes block locations to reduce downstream fetch failures.
Optional (if you manage worker environments directly):
SPARK_WORKER_OPTS=โ-Dspark.decommission.enabled=trueโ
Why this is sufficient:
When a spot instance is about to be reclaimed, Databricks receives the cloud interruption notice (typically 30 seconds to a couple of minutes in advance) and issues a decommission request to Spark. Spark then drains tasks and migrates state automatically โ no custom SIGTERM handling required.
These settings significantly reduce shuffle fetch failures and recomputation when a spot worker disappears. Job-level retries are still a good safety net for truly abrupt losses, but most preemptions are handled cleanly.
Note on pools:
A spot worker pool is still subject to eviction. Decommissioning is the recommended mitigation. On Azure in particular, evicted spot capacity does not automatically fall back to on-demand, which makes decommissioning plus sensible retries even more important.
โ01-28-2026 11:43 AM
@Louis_Frolio Thanks for the comprehensive answers!
In my case, we do have a few workflows that are using single-node clusters. Do you have any suggestions in this case?
โ02-05-2026 06:52 AM
@Kirankumarbs โ in the case of a single-node cluster, wellโฆ you get what you get ๐. A single-node cluster can only run on an on-demand instance. That one node hosts both the driver and the executor, so if the node goes away, thereโs no recovery path.
Hope this helps, Louis.