Databricks is a popular unified data analytics platform known for its powerful data processing capabilities and seamless integration with Apache Spark. However, managing and optimizing costs in Databricks can be challenging, especially when it comes to choosing the right cluster size for different workloads. This article explores how to dynamically select cluster sizes to save costs, leveraging Databricks cluster pools and analysing logs from previous runs.
Understanding Databricks Cluster Pools
Databricks cluster pools are a way to manage and provision clusters more efficiently. A cluster pool reduces cluster start and auto-scaling times by maintaining a set of ready-to-use instances. When a new cluster is requested, it can be created quickly from the pool, minimizing the time and cost associated with cluster initialization.
Key benefits of using cluster pools include:
- Reduced startup time: Pre-configured instances are available to be quickly allocated to clusters.
- Cost savings: By managing the number of instances in a pool, you can control the costs more effectively.
- Consistency: Pools ensure that clusters are created with consistent configurations, reducing variability and potential issues.
Dynamically Choosing Cluster Sizes
To optimize costs further, you can dynamically select cluster sizes based on historical data from previous runs. This involves analysing logs to determine the amount of data processed and then using this information to choose the appropriate cluster size from different predefined pools.
Steps to Implement Dynamic Cluster Sizing
- Log Analysis: Collect and analyse logs from previous runs to understand the data volume and processing requirements.
- Define Cluster Pools: Create different cluster pools based on workload requirements (e.g., small, medium, large).
- Set Flags Based on Data Analysis: Use historical data to set flags that determine the cluster size needed for future runs.
- Dynamic Cluster Allocation: Implement logic to dynamically select and allocate clusters from the appropriate pool based on the flags.
Example Implementation
Let's walk through an example implementation of dynamically choosing cluster sizes based on previous run data.
Step 1: Log Analysis
First, collect logs that contain information about the data processed in previous runs. This can include the number of records processed, the size of the data, and the time taken.
import pandas as pd
# Sample log data
logs = pd.DataFrame({
'run_id': [1, 2, 3, 4, 5],
'data_size_gb': [10, 25, 50, 5, 35],
'record_count': [100000, 250000, 500000, 50000, 350000],
'processing_time_min': [30, 70, 120, 20, 90]
})
# Analyzing the logs
average_data_size = logs['data_size_gb'].mean()
average_record_count = logs['record_count'].mean()
Step 2: Define Cluster Pools
Create different cluster pools based on anticipated workload sizes.
# Example cluster pool definitions with configurations
cluster_pools = {
'small': {
'min_workers': 2,
'max_workers': 4,
'node_type_id': 'i3.xlarge',
'driver_node_type_id': 'i3.xlarge',
'spark_conf': {
'spark.databricks.cluster.profile': 'singleNode',
'spark.master': 'local[*]'
},
'autotermination_minutes': 20,
'enable_elastic_disk': True
},
'medium': {
'min_workers': 4,
'max_workers': 8,
'node_type_id': 'r5.2xlarge',
'driver_node_type_id': 'r5.2xlarge',
'spark_conf': {
'spark.executor.memory': '16g',
'spark.executor.cores': '4'
},
'autotermination_minutes': 30,
'enable_elastic_disk': True
},
'large': {
'min_workers': 8,
'max_workers': 16,
'node_type_id': 'r5.4xlarge',
'driver_node_type_id': 'r5.4xlarge',
'spark_conf': {
'spark.executor.memory': '32g',
'spark.executor.cores': '8'
},
'autotermination_minutes': 60,
'enable_elastic_disk': True
}
}
Step 3: Set Flags Based on Data Analysis
Determine thresholds for choosing different cluster sizes based on historical data.
# Setting thresholds
small_threshold = 20 # GB
medium_threshold = 40 # GB
def determine_cluster_size(data_size_gb):
if data_size_gb <= small_threshold:
return 'small'
elif data_size_gb <= medium_threshold:
return 'medium'
else:
return 'large'
Step 4: Dynamic Cluster Allocation
Use the flag to dynamically choose the cluster pool.
# Example data size for the current run
current_data_size = 30 # GB
# Determine cluster size based on the current data size
selected_cluster_size = determine_cluster_size(current_data_size)
selected_pool = cluster_pools[selected_cluster_size]
# Print the selected pool configuration
print(f"Selected Cluster Pool: {selected_cluster_size}")
print(f"Cluster Configuration: {selected_pool}")
Benchmarking: Cost Savings from Dynamic Cluster Sizing in Databricks
To quantify the benefits of dynamically choosing cluster sizes in Databricks, it's essential to conduct a benchmarking exercise. This involves comparing the costs and performance metrics before and after implementing dynamic cluster sizing. Let's assume we have collected data from multiple runs of a typical workload over a month.
Before Implementation: Static Cluster Allocation
In the static cluster allocation scenario, we use a predefined large cluster for all workloads, regardless of their size. The cluster configuration is as follows:
- Cluster Configuration: 16 nodes (r5.4xlarge)
- Average monthly cost: $25,000
- Average startup time: 5 minutes
- Average processing time per run: 1 hour
- Total number of runs per month: 100
After Implementation: Dynamic Cluster Allocation
In the dynamic cluster allocation scenario, clusters are chosen based on the size of the workload. Let's assume the cluster configurations and their associated costs are as follows:
- Small Cluster (2-4 nodes, i3.xlarge): $0.50 per hour per node
- Medium Cluster (4-8 nodes, r5.2xlarge): $1.00 per hour per node
- Large Cluster (8-16 nodes, r5.4xlarge): $2.00 per hour per node
Based on our log analysis, we categorize the runs as follows:
- Small workloads: 40 runs per month
- Medium workloads: 40 runs per month
- Large workloads: 20 runs per month
Detailed Cost Analysis
Before Implementation
- Cluster Cost Calculation:
Hourly cost per node=2.00 USD
Total hourly cost=16 nodes×2.00 USD =32 USD
Monthly cost=100 runs×1 hour per run×32 USD =3,200 USD
- Startup Cost Calculation:
Startup time cost per run=5 minutes=5/60 hour×32 USD =2.67 USD
Total startup cost per month=100 runs×2.67 USD = 267 USD
- Total Monthly Cost Before:
Total monthly cost=3,200 USD + 267 USD = 3,467 USD
After Implementation
- Cluster Cost Calculation for Each Category:
Hourly cost per node=0.50 USD
Total hourly cost=3 nodes×0.50 USD =1.50 USD
Monthly cost for small workloads=40 runs×1 hour per run×1.50 USD =60 USD
Hourly cost per node=1.00 USD
Total hourly cost=6 nodes×1.00 USD =6.00 USD
Monthly cost for medium workloads=40 runs×1 hour per run×6.00 USD =240 USD
Hourly cost per node=2.00 USD
Total hourly cost=12 nodes×2.00 USD =24 USD
Monthly cost for large workloads=20 runs×1 hour per run×24 USD =480 USD
Total monthly cost for clusters : 60 USD +240 USD +480 USD =780 USD
- Startup Cost Calculation:
Assuming the dynamic allocation reduces the average startup time to 2 minutes:
Startup time cost per run=2 minutes=2/60 hour×Average hourly cost of cluster
Calculating average hourly cost of clusters:
Average hourly cost= (1.50 USD +6.00 USD +24.00 USD ) / 3 =10.50 USD
Startup time cost per run=2/60×10.50 USD =0.35 USD
Total startup cost per month : 100 runs×0.35 USD =35 USD
- Total Monthly Cost After:
Total monthly cost : 780 USD +35 USD =815 USD
Summary of Cost Savings
- Total monthly cost before dynamic sizing: $3,467
- Total monthly cost after dynamic sizing: $815
- Monthly cost savings: $3,467 - $815 = $2,652
Additional Benefits
Performance Improvements
- Startup time reduction: From 5 minutes to 2 minutes, reducing waiting time by 60%.
- Improved resource utilization: By right-sizing clusters, the resources are better utilized, avoiding over-provisioning.
Environmental Impact
- Reduced energy consumption: Using smaller clusters for smaller workloads decreases the overall energy usage, contributing to a greener environment.
Conclusion
Implementing dynamic cluster sizing in Databricks can lead to significant cost savings and performance improvements. By leveraging historical data and cluster pools, organizations can ensure that each workload is matched with the appropriate resources, leading to optimized costs and enhanced efficiency. This approach not only saves cost but also promotes sustainable and efficient resource utilization.