cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Frequent spot loss of driver nodes resulting in failed jobs when using spot fleet pools

User16783853906
Contributor III

When using spot fleet pools to schedule jobs, driver and worker nodes are provisioned from the spot pools and we are noticing jobs failing with the below exception when there is a driver spot loss. Share best practices around using fleet pools with 100% spot config.

Run result unavailable: job failed with error message
 Cluster {cluster_id} was terminated during the run (cluster state message: Driver node shut down by cloud provider. instance_id: {instance_id}, aws_instance_state_reason: Client.UserInitiatedShutdown, aws_error_message: Client.UserInitiatedShutdown: User initiated s...)

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.

Steps to configure cluster from hybrid pools.

1. Create/(use existing) on demand pool from which driver node can be provisioned.

2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.

Example API request to create a hybrid pool cluster - 

{

  "num_workers": 1,

  "cluster_name": "test-hybrid-create",

  "spark_version": "7.2.x-scala2.12",

  "spark_conf": {},

  "aws_attributes": {

  },

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "init_scripts": [],

  "instance_pool_id": "1109-172550-mimic2-pool-worker",

  "driver_instance_pool_id": "1109-172516-retch1-pool-driver"

}

Assumptions & Limitations -

* Creating Hybrid pool clusters is only supported via API as of now.

* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.

View solution in original post

3 REPLIES 3

User16826994223
Honored Contributor III

Is it happening in Databricks ?

Azure Databricks automatically handles the termination of Spot VMs by starting new pay-as-you-go worker nodes to guarantee your jobs will eventually complete. This provides predictability while helping to lower costs.

The driver node is lost in this scenario and hybrid pools helped by allocating driver from an on demand pool.

User16783853906
Contributor III

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.

Steps to configure cluster from hybrid pools.

1. Create/(use existing) on demand pool from which driver node can be provisioned.

2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.

Example API request to create a hybrid pool cluster - 

{

  "num_workers": 1,

  "cluster_name": "test-hybrid-create",

  "spark_version": "7.2.x-scala2.12",

  "spark_conf": {},

  "aws_attributes": {

  },

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "init_scripts": [],

  "instance_pool_id": "1109-172550-mimic2-pool-worker",

  "driver_instance_pool_id": "1109-172516-retch1-pool-driver"

}

Assumptions & Limitations -

* Creating Hybrid pool clusters is only supported via API as of now.

* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group