Louis_Frolio
Databricks Employee
Databricks Employee

Short answer: No — you don’t need to implement your own SIGTERM handler, even with option 2 (separate on-demand driver pool + spot worker pool). Databricks already detects cloud spot-preemption notices and automatically kicks off Spark’s decommissioning path. As long as decommissioning is enabled, Spark will gracefully drain work and migrate data for you. There’s no need to trap OS signals in your job code.

What to enable on the cluster (this applies whether or not you use pools):

  • spark.decommission.enabled=true

    Turns on executor decommissioning.

  • spark.storage.decommission.enabled=true

    Allows cached data to be moved during decommission.

  • spark.storage.decommission.shuffleBlocks.enabled=true

    Migrates shuffle blocks off the executor being reclaimed.

  • spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled=true

    Refreshes block locations to reduce downstream fetch failures.

  • Optional (if you manage worker environments directly):

    SPARK_WORKER_OPTS=”-Dspark.decommission.enabled=true”

Why this is sufficient:

  • When a spot instance is about to be reclaimed, Databricks receives the cloud interruption notice (typically 30 seconds to a couple of minutes in advance) and issues a decommission request to Spark. Spark then drains tasks and migrates state automatically — no custom SIGTERM handling required.

  • These settings significantly reduce shuffle fetch failures and recomputation when a spot worker disappears. Job-level retries are still a good safety net for truly abrupt losses, but most preemptions are handled cleanly.

Note on pools:

A spot worker pool is still subject to eviction. Decommissioning is the recommended mitigation. On Azure in particular, evicted spot capacity does not automatically fall back to on-demand, which makes decommissioning plus sensible retries even more important.