cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Frequent spot loss of driver nodes resulting in failed jobs when using spot fleet pools

User16783853906
Contributor III

When using spot fleet pools to schedule jobs, driver and worker nodes are provisioned from the spot pools and we are noticing jobs failing with the below exception when there is a driver spot loss. Share best practices around using fleet pools with 100% spot config.

Run result unavailable: job failed with error message
 Cluster {cluster_id} was terminated during the run (cluster state message: Driver node shut down by cloud provider. instance_id: {instance_id}, aws_instance_state_reason: Client.UserInitiatedShutdown, aws_error_message: Client.UserInitiatedShutdown: User initiated s...)

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.

Steps to configure cluster from hybrid pools.

1. Create/(use existing) on demand pool from which driver node can be provisioned.

2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.

Example API request to create a hybrid pool cluster - 

{

  "num_workers": 1,

  "cluster_name": "test-hybrid-create",

  "spark_version": "7.2.x-scala2.12",

  "spark_conf": {},

  "aws_attributes": {

  },

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "init_scripts": [],

  "instance_pool_id": "1109-172550-mimic2-pool-worker",

  "driver_instance_pool_id": "1109-172516-retch1-pool-driver"

}

Assumptions & Limitations -

* Creating Hybrid pool clusters is only supported via API as of now.

* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.

View solution in original post

3 REPLIES 3

User16826994223
Honored Contributor III

Is it happening in Databricks ?

Azure Databricks automatically handles the termination of Spot VMs by starting new pay-as-you-go worker nodes to guarantee your jobs will eventually complete. This provides predictability while helping to lower costs.

The driver node is lost in this scenario and hybrid pools helped by allocating driver from an on demand pool.

User16783853906
Contributor III

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.

Steps to configure cluster from hybrid pools.

1. Create/(use existing) on demand pool from which driver node can be provisioned.

2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.

Example API request to create a hybrid pool cluster - 

{

  "num_workers": 1,

  "cluster_name": "test-hybrid-create",

  "spark_version": "7.2.x-scala2.12",

  "spark_conf": {},

  "aws_attributes": {

  },

  "ssh_public_keys": [],

  "custom_tags": {},

  "spark_env_vars": {

    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"

  },

  "autotermination_minutes": 120,

  "init_scripts": [],

  "instance_pool_id": "1109-172550-mimic2-pool-worker",

  "driver_instance_pool_id": "1109-172516-retch1-pool-driver"

}

Assumptions & Limitations -

* Creating Hybrid pool clusters is only supported via API as of now.

* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.