Databricks

User16783853906 · ‎06-07-2021

When using spot fleet pools to schedule jobs, driver and worker nodes are provisioned from the spot pools and we are noticing jobs failing with the below exception when there is a driver spot loss. Share best practices around using fleet pools with 100% spot config.

Run result unavailable: job failed with error message
 Cluster {cluster_id} was terminated during the run (cluster state message: Driver node shut down by cloud provider. instance_id: {instance_id}, aws_instance_state_reason: Client.UserInitiatedShutdown, aws_error_message: Client.UserInitiatedShutdown: User initiated s...)

User16783853906 · ‎06-23-2021

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability in frequent spot loss scenarios and worker nodes can be provisioned from your spot fleet pool. This functionality is supported only by API as of today.

Steps to configure cluster from hybrid pools.

1. Create/(use existing) on demand pool from which driver node can be provisioned.

2. When creating a cluster, you can provide on demand pool id as "driver_instance_pool_id" in the cluster creation request.

Example API request to create a hybrid pool cluster -

{

"num_workers": 1,

"cluster_name": "test-hybrid-create",

"spark_version": "7.2.x-scala2.12",

"spark_conf": {},

"aws_attributes": {

},

"ssh_public_keys": [],

"custom_tags": {},

"spark_env_vars": {

"PYSPARK_PYTHON": "/databricks/python3/bin/python3"

},

"autotermination_minutes": 120,

"init_scripts": [],

"instance_pool_id": "1109-172550-mimic2-pool-worker",

"driver_instance_pool_id": "1109-172516-retch1-pool-driver"

}

Assumptions & Limitations -

* Creating Hybrid pool clusters is only supported via API as of now.

* We recommend to test this functionality and see if it helps in your case before using in Production scenarios.

View solution in original post

User16826994223 · ‎06-18-2021

Is it happening in Databricks ?

Azure Databricks automatically handles the termination of Spot VMs by starting new pay-as-you-go worker nodes to guarantee your jobs will eventually complete. This provides predictability while helping to lower costs.

User16783853906 · ‎06-25-2021

The driver node is lost in this scenario and hybrid pools helped by allocating driver from an on demand pool.

User16783853906 · ‎06-23-2021