cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Cluster pools

ksenija
Contributor

Could you help me understand pools?

How to know the difference in pricing between running clusters and running clusters with a pool? Since we're saving time to start/stop the cluster when we have a pool.

And should we keep Min Idle above 0 or equal to 0?

Also, what's your best practice: do you use pools all the time or only for some reporting/urgent purposes?

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

Databricks pools are a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the poolโ€™s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the clusterโ€™s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Databricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. However, cloud provider infrastructure costs do apply.

For the Min Idle setting, it's recommended to set the Min Idle instances to 0 to avoid paying for running instances that arenโ€™t doing work. However, this could result in a possible increase in time when a cluster needs to acquire a new instance. If you're only running interactive workloads during business hours, make sure the pool's "Min Idle" instance count is set to zero after hours. Or if your automated data pipeline runs for a few hours at night, set the "Min Idle" count a few minutes before the pipeline starts and then revert it to zero afterwards.

As for the best practice of using pools, it depends on your specific use case. If your driver node and worker nodes have different requirements, create a different pool for each. You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type. Also, consider using spot instances to reduce costs and on-demand instances for jobs with short execution times and strict execution time requirements.

 

https://www.databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

View solution in original post

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

Databricks pools are a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the poolโ€™s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the clusterโ€™s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Databricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. However, cloud provider infrastructure costs do apply.

For the Min Idle setting, it's recommended to set the Min Idle instances to 0 to avoid paying for running instances that arenโ€™t doing work. However, this could result in a possible increase in time when a cluster needs to acquire a new instance. If you're only running interactive workloads during business hours, make sure the pool's "Min Idle" instance count is set to zero after hours. Or if your automated data pipeline runs for a few hours at night, set the "Min Idle" count a few minutes before the pipeline starts and then revert it to zero afterwards.

As for the best practice of using pools, it depends on your specific use case. If your driver node and worker nodes have different requirements, create a different pool for each. You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type. Also, consider using spot instances to reduce costs and on-demand instances for jobs with short execution times and strict execution time requirements.

 

https://www.databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group