Databricks Community

ksenija · ‎04-11-2024

Could you help me understand pools?

How to know the difference in pricing between running clusters and running clusters with a pool? Since we're saving time to start/stop the cluster when we have a pool.

And should we keep Min Idle above 0 or equal to 0?

Also, what's your best practice: do you use pools all the time or only for some reporting/urgent purposes?

Walter_C · ‎04-13-2024

Databricks pools are a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Databricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. However, cloud provider infrastructure costs do apply.

For the Min Idle setting, it's recommended to set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. However, this could result in a possible increase in time when a cluster needs to acquire a new instance. If you're only running interactive workloads during business hours, make sure the pool's "Min Idle" instance count is set to zero after hours. Or if your automated data pipeline runs for a few hours at night, set the "Min Idle" count a few minutes before the pipeline starts and then revert it to zero afterwards.

As for the best practice of using pools, it depends on your specific use case. If your driver node and worker nodes have different requirements, create a different pool for each. You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type. Also, consider using spot instances to reduce costs and on-demand instances for jobs with short execution times and strict execution time requirements.

https://www.databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

View solution in original post

Walter_C · ‎04-13-2024

Databricks pools are a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Databricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. However, cloud provider infrastructure costs do apply.

For the Min Idle setting, it's recommended to set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. However, this could result in a possible increase in time when a cluster needs to acquire a new instance. If you're only running interactive workloads during business hours, make sure the pool's "Min Idle" instance count is set to zero after hours. Or if your automated data pipeline runs for a few hours at night, set the "Min Idle" count a few minutes before the pipeline starts and then revert it to zero afterwards.

As for the best practice of using pools, it depends on your specific use case. If your driver node and worker nodes have different requirements, create a different pool for each. You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type. Also, consider using spot instances to reduce costs and on-demand instances for jobs with short execution times and strict execution time requirements.

https://www.databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html