- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-28-2026 08:11 AM
Greetings @jeremy98 , I did some research and found some helpful hints/tips to help you think about your scenario.
Running batch jobs cost-effectively on spot instances really comes down to two things working together:
-
a resilient Databricks/Spark configuration, and
-
simple signal handling in your application so it can shut down cleanly when capacity is reclaimed.
What to configure on Databricks for spot instances
First, keep the driver on on-demand and use spot for workers. This is the single most important guardrail. Losing executors is survivable; losing the driver is not. In the UI, that means the first node is on-demand and subsequent nodes can be spot. Also avoid using a spot pool for the driver type.
Second, enable decommissioning on clusters that use spot. When a preemption notice arrives (typically 30 seconds to a couple of minutes depending on cloud), Spark will try to migrate shuffle and RDD data off the executor that’s about to go away. This dramatically reduces failures and recomputation, and importantly, failures caused by preemption aren’t counted as normal task failures.
If you’re seeing frequent spot loss, that’s usually a signal to reconsider instance types or the workload itself. Some instance families are reclaimed far more aggressively than others, and certain jobs just aren’t good candidates for spot.
For better acquisition and price stability, consider Fleet or flexible node types and pools. This lets Databricks choose from multiple equivalent instance types and improves your odds of getting spot capacity at a reasonable price. You can also cap spot pricing as a percentage of on-demand.
From a planning perspective, Databricks’ cost-optimization guidance consistently recommends spot workers with an on-demand driver, with the trade-off being stricter SLAs. On Azure in particular, Databricks will automatically replace evicted spot workers with pay-as-you-go nodes so jobs can continue predictably, even though eviction notices may be short.
Finally, if you’re hitting stage failures during shuffles, that’s a classic symptom of spot reclamation mid-execution. Decommissioning is the first fix; switching critical paths to on-demand workers is the fallback.
Capturing SIGTERM for graceful shutdown
When a cloud provider reclaims a spot instance, the OS receives a termination signal. Your application should trap SIGTERM, clean up, and exit quickly. Here are minimal patterns you can drop into batch apps.
Python:
import signal, sys, time
shutting_down = False
def handle_sigterm(signum, frame):
global shutting_down
shutting_down = True
# flush logs, close connections, persist checkpoints
signal.signal(signal.SIGTERM, handle_sigterm)
def main():
while not shutting_down:
# do work in small, idempotent units
time.sleep(0.5)
sys.exit(0)
if __name__ == "__main__":
main()
Java:
public class App {
public static void main(String[] args) {
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
// flush, close, checkpoint
}));
// main job logic
}
}
Scala:
object App extends App {
sys.addShutdownHook {
// flush, close, checkpoint
}
// main job logic
}
Bash:
trap 'cleanup; exit 143' TERM INT
A few Spark-specific tips that really matter here. Make your work idempotent and commit in small units. Writing to Delta in batches or using foreachBatch in Structured Streaming means a retry after preemption won’t corrupt outputs.
For streaming jobs, keep a reference to the StreamingQuery and stop it in the signal handler so Spark can close sources and sinks cleanly:
query = df.writeStream.option("checkpointLocation", "...").start("...")
def handle_sigterm(signum, frame):
query.stop()
signal.signal(signal.SIGTERM, handle_sigterm)
query.awaitTermination()
The sweet spot is combining app-level signal handling with Databricks decommissioning. Spark migrates executor state, and your process exits politely.
Practical checklist
Use Jobs Compute with spot workers and an on-demand driver.
Enable decommissioning and data migration.
Trap SIGTERM and flush/close quickly; keep outputs idempotent.
Right-size instance types and consider Fleet/flexible nodes.
If evictions are constant, temporarily move critical workloads to on-demand.
Hope this helps, Louis.