06-08-2024 08:20 AM
Databricks is a popular unified data analytics platform known for its powerful data processing capabilities and seamless integration with Apache Spark. However, managing and optimizing costs in Databricks can be challenging, especially when it comes to choosing the right cluster size for different workloads. This article explores how to dynamically select cluster sizes to save costs, leveraging Databricks cluster pools and analysing logs from previous runs.
Understanding Databricks Cluster Pools
Databricks cluster pools are a way to manage and provision clusters more efficiently. A cluster pool reduces cluster start and auto-scaling times by maintaining a set of ready-to-use instances. When a new cluster is requested, it can be created quickly from the pool, minimizing the time and cost associated with cluster initialization.
Key benefits of using cluster pools include:
Dynamically Choosing Cluster Sizes
To optimize costs further, you can dynamically select cluster sizes based on historical data from previous runs. This involves analysing logs to determine the amount of data processed and then using this information to choose the appropriate cluster size from different predefined pools.
Steps to Implement Dynamic Cluster Sizing
Example Implementation
First, collect logs that contain information about the data processed in previous runs. This can include the number of records processed, the size of the data, and the time taken.
import pandas as pd
# Sample log data
logs = pd.DataFrame({
'run_id': [1, 2, 3, 4, 5],
'data_size_gb': [10, 25, 50, 5, 35],
'record_count': [100000, 250000, 500000, 50000, 350000],
'processing_time_min': [30, 70, 120, 20, 90]
})
# Analyzing the logs
average_data_size = logs['data_size_gb'].mean()
average_record_count = logs['record_count'].mean()
Step 2: Define Cluster Pools
Create different cluster pools based on anticipated workload sizes.
# Example cluster pool definitions with configurations
cluster_pools = {
'small': {
'min_workers': 2,
'max_workers': 4,
'node_type_id': 'i3.xlarge',
'driver_node_type_id': 'i3.xlarge',
'spark_conf': {
'spark.databricks.cluster.profile': 'singleNode',
'spark.master': 'local[*]'
},
'autotermination_minutes': 20,
'enable_elastic_disk': True
},
'medium': {
'min_workers': 4,
'max_workers': 8,
'node_type_id': 'r5.2xlarge',
'driver_node_type_id': 'r5.2xlarge',
'spark_conf': {
'spark.executor.memory': '16g',
'spark.executor.cores': '4'
},
'autotermination_minutes': 30,
'enable_elastic_disk': True
},
'large': {
'min_workers': 8,
'max_workers': 16,
'node_type_id': 'r5.4xlarge',
'driver_node_type_id': 'r5.4xlarge',
'spark_conf': {
'spark.executor.memory': '32g',
'spark.executor.cores': '8'
},
'autotermination_minutes': 60,
'enable_elastic_disk': True
}
}
Step 3: Set Flags Based on Data Analysis
Determine thresholds for choosing different cluster sizes based on historical data.
# Setting thresholds
small_threshold = 20 # GB
medium_threshold = 40 # GB
def determine_cluster_size(data_size_gb):
if data_size_gb <= small_threshold:
return 'small'
elif data_size_gb <= medium_threshold:
return 'medium'
else:
return 'large'
Step 4: Dynamic Cluster Allocation
Use the flag to dynamically choose the cluster pool.
# Example data size for the current run
current_data_size = 30 # GB
# Determine cluster size based on the current data size
selected_cluster_size = determine_cluster_size(current_data_size)
selected_pool = cluster_pools[selected_cluster_size]
# Print the selected pool configuration
print(f"Selected Cluster Pool: {selected_cluster_size}")
print(f"Cluster Configuration: {selected_pool}")
To quantify the benefits of dynamically choosing cluster sizes in Databricks, it's essential to conduct a benchmarking exercise. This involves comparing the costs and performance metrics before and after implementing dynamic cluster sizing. Let's assume we have collected data from multiple runs of a typical workload over a month.
In the static cluster allocation scenario, we use a predefined large cluster for all workloads, regardless of their size. The cluster configuration is as follows:
In the dynamic cluster allocation scenario, clusters are chosen based on the size of the workload. Let's assume the cluster configurations and their associated costs are as follows:
Based on our log analysis, we categorize the runs as follows:
Hourly cost per node=2.00 USD
Total hourly cost=16 nodes×2.00 USD =32 USD
Monthly cost=100 runs×1 hour per run×32 USD =3,200 USD
Startup time cost per run=5 minutes=5/60 hour×32 USD =2.67 USD
Total startup cost per month=100 runs×2.67 USD = 267 USD
Total monthly cost=3,200 USD + 267 USD = 3,467 USD
After Implementation
Hourly cost per node=0.50 USD
Total hourly cost=3 nodes×0.50 USD =1.50 USD
Monthly cost for small workloads=40 runs×1 hour per run×1.50 USD =60 USD
Hourly cost per node=1.00 USD
Total hourly cost=6 nodes×1.00 USD =6.00 USD
Monthly cost for medium workloads=40 runs×1 hour per run×6.00 USD =240 USD
Hourly cost per node=2.00 USD
Total hourly cost=12 nodes×2.00 USD =24 USD
Monthly cost for large workloads=20 runs×1 hour per run×24 USD =480 USD
Total monthly cost for clusters : 60 USD +240 USD +480 USD =780 USD
Assuming the dynamic allocation reduces the average startup time to 2 minutes:
Startup time cost per run=2 minutes=2/60 hour×Average hourly cost of cluster
Calculating average hourly cost of clusters:
Average hourly cost= (1.50 USD +6.00 USD +24.00 USD ) / 3 =10.50 USD
Startup time cost per run=2/60×10.50 USD =0.35 USD
Total startup cost per month : 100 runs×0.35 USD =35 USD
Total monthly cost : 780 USD +35 USD =815 USD
Implementing dynamic cluster sizing in Databricks can lead to significant cost savings and performance improvements. By leveraging historical data and cluster pools, organizations can ensure that each workload is matched with the appropriate resources, leading to optimized costs and enhanced efficiency. This approach not only saves cost but also promotes sustainable and efficient resource utilization.
06-25-2024 03:09 AM
@Harun This is amazing! Thank you for sharing
03-12-2025 08:21 AM
How can this actually be used to choose a cluster pool for a Databricks workflow dynamically, that is, at run time? In other words, what can you actually do with the value of `selected_pool` other than printing it out?
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now