Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Harun — Sat, 08 Jun 2024 15:20:11 GMT

Databricks is a popular unified data analytics platform known for its powerful data processing capabilities and seamless integration with Apache Spark. However, managing and optimizing costs in Databricks can be challenging, especially when it comes to choosing the right cluster size for different workloads. This article explores how to dynamically select cluster sizes to save costs, leveraging Databricks cluster pools and analysing logs from previous runs.

Understanding Databricks Cluster Pools

Databricks cluster pools are a way to manage and provision clusters more efficiently. A cluster pool reduces cluster start and auto-scaling times by maintaining a set of ready-to-use instances. When a new cluster is requested, it can be created quickly from the pool, minimizing the time and cost associated with cluster initialization.

Key benefits of using cluster pools include:

Reduced startup time: Pre-configured instances are available to be quickly allocated to clusters.
Cost savings: By managing the number of instances in a pool, you can control the costs more effectively.
Consistency: Pools ensure that clusters are created with consistent configurations, reducing variability and potential issues.

Dynamically Choosing Cluster Sizes

To optimize costs further, you can dynamically select cluster sizes based on historical data from previous runs. This involves analysing logs to determine the amount of data processed and then using this information to choose the appropriate cluster size from different predefined pools.

Steps to Implement Dynamic Cluster Sizing

Log Analysis: Collect and analyse logs from previous runs to understand the data volume and processing requirements.
Define Cluster Pools: Create different cluster pools based on workload requirements (e.g., small, medium, large).
Set Flags Based on Data Analysis: Use historical data to set flags that determine the cluster size needed for future runs.
Dynamic Cluster Allocation: Implement logic to dynamically select and allocate clusters from the appropriate pool based on the flags.

Example Implementation

Let's walk through an example implementation of dynamically choosing cluster sizes based on previous run data.

Step 1: Log Analysis

First, collect logs that contain information about the data processed in previous runs. This can include the number of records processed, the size of the data, and the time taken.

import pandas as pd # Sample log data logs = pd.DataFrame({ 'run_id': [1, 2, 3, 4, 5], 'data_size_gb': [10, 25, 50, 5, 35], 'record_count': [100000, 250000, 500000, 50000, 350000], 'processing_time_min': [30, 70, 120, 20, 90] }) # Analyzing the logs average_data_size = logs['data_size_gb'].mean() average_record_count = logs['record_count'].mean()

Step 2: Define Cluster Pools

Create different cluster pools based on anticipated workload sizes.

# Example cluster pool definitions with configurations cluster_pools = { 'small': { 'min_workers': 2, 'max_workers': 4, 'node_type_id': 'i3.xlarge', 'driver_node_type_id': 'i3.xlarge', 'spark_conf': { 'spark.databricks.cluster.profile': 'singleNode', 'spark.master': 'local[*]' }, 'autotermination_minutes': 20, 'enable_elastic_disk': True }, 'medium': { 'min_workers': 4, 'max_workers': 8, 'node_type_id': 'r5.2xlarge', 'driver_node_type_id': 'r5.2xlarge', 'spark_conf': { 'spark.executor.memory': '16g', 'spark.executor.cores': '4' }, 'autotermination_minutes': 30, 'enable_elastic_disk': True }, 'large': { 'min_workers': 8, 'max_workers': 16, 'node_type_id': 'r5.4xlarge', 'driver_node_type_id': 'r5.4xlarge', 'spark_conf': { 'spark.executor.memory': '32g', 'spark.executor.cores': '8' }, 'autotermination_minutes': 60, 'enable_elastic_disk': True } }

Step 3: Set Flags Based on Data Analysis

Determine thresholds for choosing different cluster sizes based on historical data.

# Setting thresholds small_threshold = 20 # GB medium_threshold = 40 # GB def determine_cluster_size(data_size_gb): if data_size_gb <= small_threshold: return 'small' elif data_size_gb <= medium_threshold: return 'medium' else: return 'large'

Step 4: Dynamic Cluster Allocation

Use the flag to dynamically choose the cluster pool.

# Example data size for the current run current_data_size = 30 # GB # Determine cluster size based on the current data size selected_cluster_size = determine_cluster_size(current_data_size) selected_pool = cluster_pools[selected_cluster_size] # Print the selected pool configuration print(f"Selected Cluster Pool: {selected_cluster_size}") print(f"Cluster Configuration: {selected_pool}")

Benchmarking: Cost Savings from Dynamic Cluster Sizing in Databricks

To quantify the benefits of dynamically choosing cluster sizes in Databricks, it's essential to conduct a benchmarking exercise. This involves comparing the costs and performance metrics before and after implementing dynamic cluster sizing. Let's assume we have collected data from multiple runs of a typical workload over a month.

Before Implementation: Static Cluster Allocation

In the static cluster allocation scenario, we use a predefined large cluster for all workloads, regardless of their size. The cluster configuration is as follows:

Cluster Configuration: 16 nodes (r5.4xlarge)
Average monthly cost: $25,000
Average startup time: 5 minutes
Average processing time per run: 1 hour
Total number of runs per month: 100

After Implementation: Dynamic Cluster Allocation

In the dynamic cluster allocation scenario, clusters are chosen based on the size of the workload. Let's assume the cluster configurations and their associated costs are as follows:

Small Cluster (2-4 nodes, i3.xlarge): $0.50 per hour per node
Medium Cluster (4-8 nodes, r5.2xlarge): $1.00 per hour per node
Large Cluster (8-16 nodes, r5.4xlarge): $2.00 per hour per node

Based on our log analysis, we categorize the runs as follows:

Small workloads: 40 runs per month
Medium workloads: 40 runs per month
Large workloads: 20 runs per month

Detailed Cost Analysis

Before Implementation

Cluster Cost Calculation:

Hourly cost per node=2.00 USD

Total hourly cost=16 nodes×2.00 USD =32 USD

Monthly cost=100 runs×1 hour per run×32 USD =3,200 USD

Startup Cost Calculation:

Startup time cost per run=5 minutes=5/60 hour×32 USD =2.67 USD

Total startup cost per month=100 runs×2.67 USD = 267 USD

Total Monthly Cost Before:

Total monthly cost=3,200 USD + 267 USD = 3,467 USD

After Implementation

Cluster Cost Calculation for Each Category:
- Small workloads:

Hourly cost per node=0.50 USD

Total hourly cost=3 nodes×0.50 USD =1.50 USD

Monthly cost for small workloads=40 runs×1 hour per run×1.50 USD =60 USD

Medium workloads:

Hourly cost per node=1.00 USD

Total hourly cost=6 nodes×1.00 USD =6.00 USD

Monthly cost for medium workloads=40 runs×1 hour per run×6.00 USD =240 USD

Large workloads:

Hourly cost per node=2.00 USD

Total hourly cost=12 nodes×2.00 USD =24 USD

Monthly cost for large workloads=20 runs×1 hour per run×24 USD =480 USD

Total monthly cost for clusters : 60 USD +240 USD +480 USD =780 USD

Startup Cost Calculation:

Assuming the dynamic allocation reduces the average startup time to 2 minutes:

Startup time cost per run=2 minutes=2/60 hour×Average hourly cost of cluster

Calculating average hourly cost of clusters:

Average hourly cost= (1.50 USD +6.00 USD +24.00 USD ) / 3 =10.50 USD

Startup time cost per run=2/60×10.50 USD =0.35 USD

Total startup cost per month : 100 runs×0.35 USD =35 USD

Total Monthly Cost After:

Total monthly cost : 780 USD +35 USD =815 USD

Summary of Cost Savings

Total monthly cost before dynamic sizing: $3,467
Total monthly cost after dynamic sizing: $815
Monthly cost savings: $3,467 - $815 = $2,652

Additional Benefits

Performance Improvements

Startup time reduction: From 5 minutes to 2 minutes, reducing waiting time by 60%.
Improved resource utilization: By right-sizing clusters, the resources are better utilized, avoiding over-provisioning.

Environmental Impact

Reduced energy consumption: Using smaller clusters for smaller workloads decreases the overall energy usage, contributing to a greener environment.

Conclusion

Implementing dynamic cluster sizing in Databricks can lead to significant cost savings and performance improvements. By leveraging historical data and cluster pools, organizations can ensure that each workload is matched with the appropriate resources, leading to optimized costs and enhanced efficiency. This approach not only saves cost but also promotes sustainable and efficient resource utilization.

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Sujitha — Tue, 25 Jun 2024 10:09:11 GMT

@Harun This is amazing! Thank you for sharing

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

kmacgregor — Wed, 12 Mar 2025 15:21:44 GMT

How can this actually be used to choose a cluster pool for a Databricks workflow dynamically, that is, at run time? In other words, what can you actually do with the value of `selected_pool` other than printing it out?

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

allyallen — Tue, 15 Jul 2025 12:30:07 GMT

Hi @kmacgregor
Did you ever find out how to dynamically choose a cluster pool for a workflow?
I've raised a similar question on another thread but wondered if you'd figured out a solution.

https://community.databricks.com/t5/data-engineering/variable-compute-clusters-within-a-job/m-p/124925#M47292

Thanks in advance!

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

mame17 — Tue, 30 Dec 2025 04:55:40 GMT

@Second Reply You’re right just printing out selected_pool isn’t enough to actually leverage dynamic cluster sizing at runtime. In practice, the value of selected_pool would feed directly into your Databricks cluster creation API or workflow automation. For example, if you’re using Databricks Jobs or a workflow orchestrator like Airflow, you can programmatically pass the configuration from selected_pool to create or start the appropriate cluster just before your workload runs. This way, each job automatically scales to the right size without manual intervention.

A practical tip: combine this with a logging system that tracks the performance of each run. Over time, you can refine thresholds for small, medium, or large clusters, ensuring costs stay optimized while avoiding under-provisioning. Some teams even integrate cost monitoring dashboards or lightweight tools similar to a Household Spending Tracker for personal finance, but for compute so you can quickly see which jobs are consuming the most resources and adjust cluster policies dynamically.

This approach keeps your workloads efficient, reduces wait times, and ensures you only pay for what you actually need.you can find our app on playstore as {couplesexpensebudgettracker}

topic Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes in Community Articles

Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Let's walk through an example implementation of dynamically choosing cluster sizes based on previous run data.

Step 1: Log Analysis

Benchmarking: Cost Savings from Dynamic Cluster Sizing in Databricks

Before Implementation: Static Cluster Allocation

After Implementation: Dynamic Cluster Allocation

Detailed Cost Analysis

Before Implementation

Summary of Cost Savings

Additional Benefits

Performance Improvements

Environmental Impact

Conclusion

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes

Re: Optimizing Costs in Databricks by Dynamically Choosing Cluster Sizes