06-07-2021 12:05 PM
When using spot fleet pools to schedule jobs, driver and worker nodes are provisioned from the spot pools and we are noticing jobs failing with the below exception when there is a driver spot loss. Share best practices around using fleet pools with 100% spot config.
Run result unavailable: job failed with error message
Cluster {cluster_id} was terminated during the run (cluster state message: Driver node shut down by cloud provider. instance_id: {instance_id}, aws_instance_state_reason: Client.UserInitiatedShutdown, aws_error_message: Client.UserInitiatedShutdown: User initiated s...)
06-23-2021 02:20 PM
In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.
Steps to configure cluster from hybrid pools.
1. Create/(use existing) on demand pool from which driver node can be provisioned.
2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.
Example API request to create a hybrid pool cluster -
{
"num_workers": 1,
"cluster_name": "test-hybrid-create",
"spark_version": "7.2.x-scala2.12",
"spark_conf": {},
"aws_attributes": {
},
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"instance_pool_id": "1109-172550-mimic2-pool-worker",
"driver_instance_pool_id": "1109-172516-retch1-pool-driver"
}
Assumptions & Limitations -
* Creating Hybrid pool clusters is only supported via API as of now.
* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.
06-18-2021 06:08 AM
Is it happening in Databricks ?
Azure Databricks automatically handles the termination of Spot VMs by starting new pay-as-you-go worker nodes to guarantee your jobs will eventually complete. This provides predictability while helping to lower costs.
06-25-2021 12:08 PM
The driver node is lost in this scenario and hybrid pools helped by allocating driver from an on demand pool.
06-23-2021 02:20 PM
In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.
Steps to configure cluster from hybrid pools.
1. Create/(use existing) on demand pool from which driver node can be provisioned.
2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.
Example API request to create a hybrid pool cluster -
{
"num_workers": 1,
"cluster_name": "test-hybrid-create",
"spark_version": "7.2.x-scala2.12",
"spark_conf": {},
"aws_attributes": {
},
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"instance_pool_id": "1109-172550-mimic2-pool-worker",
"driver_instance_pool_id": "1109-172516-retch1-pool-driver"
}
Assumptions & Limitations -
* Creating Hybrid pool clusters is only supported via API as of now.
* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group