Databricks Community

Harun · ‎06-08-2024

Databricks is a popular unified data analytics platform known for its powerful data processing capabilities and seamless integration with Apache Spark. However, managing and optimizing costs in Databricks can be challenging, especially when it comes to choosing the right cluster size for different workloads. This article explores how to dynamically select cluster sizes to save costs, leveraging Databricks cluster pools and analysing logs from previous runs.

Understanding Databricks Cluster Pools

Databricks cluster pools are a way to manage and provision clusters more efficiently. A cluster pool reduces cluster start and auto-scaling times by maintaining a set of ready-to-use instances. When a new cluster is requested, it can be created quickly from the pool, minimizing the time and cost associated with cluster initialization.

Key benefits of using cluster pools include:

Reduced startup time: Pre-configured instances are available to be quickly allocated to clusters.
Cost savings: By managing the number of instances in a pool, you can control the costs more effectively.
Consistency: Pools ensure that clusters are created with consistent configurations, reducing variability and potential issues.

Dynamically Choosing Cluster Sizes

To optimize costs further, you can dynamically select cluster sizes based on historical data from previous runs. This involves analysing logs to determine the amount of data processed and then using this information to choose the appropriate cluster size from different predefined pools.

Steps to Implement Dynamic Cluster Sizing

Log Analysis: Collect and analyse logs from previous runs to understand the data volume and processing requirements.
Define Cluster Pools: Create different cluster pools based on workload requirements (e.g., small, medium, large).
Set Flags Based on Data Analysis: Use historical data to set flags that determine the cluster size needed for future runs.
Dynamic Cluster Allocation: Implement logic to dynamically select and allocate clusters from the appropriate pool based on the flags.

Example Implementation

Let's walk through an example implementation of dynamically choosing cluster sizes based on previous run data.

Step 1: Log Analysis

First, collect logs that contain information about the data processed in previous runs. This can include the number of records processed, the size of the data, and the time taken.

import pandas as pd

# Sample log data
logs = pd.DataFrame({
    'run_id': [1, 2, 3, 4, 5],
    'data_size_gb': [10, 25, 50, 5, 35],
    'record_count': [100000, 250000, 500000, 50000, 350000],
    'processing_time_min': [30, 70, 120, 20, 90]
})

# Analyzing the logs
average_data_size = logs['data_size_gb'].mean()
average_record_count = logs['record_count'].mean()

Step 2: Define Cluster Pools

Create different cluster pools based on anticipated workload sizes.

# Example cluster pool definitions with configurations
cluster_pools = {
    'small': {
        'min_workers': 2,
        'max_workers': 4,
        'node_type_id': 'i3.xlarge',
        'driver_node_type_id': 'i3.xlarge',
        'spark_conf': {
            'spark.databricks.cluster.profile': 'singleNode',
            'spark.master': 'local[*]'
        },
        'autotermination_minutes': 20,
        'enable_elastic_disk': True
    },
    'medium': {
        'min_workers': 4,
        'max_workers': 8,
        'node_type_id': 'r5.2xlarge',
        'driver_node_type_id': 'r5.2xlarge',
        'spark_conf': {
            'spark.executor.memory': '16g',
            'spark.executor.cores': '4'
        },
        'autotermination_minutes': 30,
        'enable_elastic_disk': True
    },
    'large': {
        'min_workers': 8,
        'max_workers': 16,
        'node_type_id': 'r5.4xlarge',
        'driver_node_type_id': 'r5.4xlarge',
        'spark_conf': {
            'spark.executor.memory': '32g',
            'spark.executor.cores': '8'
        },
        'autotermination_minutes': 60,
        'enable_elastic_disk': True
    }
}

Step 3: Set Flags Based on Data Analysis

Determine thresholds for choosing different cluster sizes based on historical data.

# Setting thresholds
small_threshold = 20  # GB
medium_threshold = 40  # GB

def determine_cluster_size(data_size_gb):
    if data_size_gb <= small_threshold:
        return 'small'
    elif data_size_gb <= medium_threshold:
        return 'medium'
    else:
        return 'large'

Step 4: Dynamic Cluster Allocation

Use the flag to dynamically choose the cluster pool.

# Example data size for the current run
current_data_size = 30  # GB

# Determine cluster size based on the current data size
selected_cluster_size = determine_cluster_size(current_data_size)
selected_pool = cluster_pools[selected_cluster_size]

# Print the selected pool configuration
print(f"Selected Cluster Pool: {selected_cluster_size}")
print(f"Cluster Configuration: {selected_pool}")

Benchmarking: Cost Savings from Dynamic Cluster Sizing in Databricks

To quantify the benefits of dynamically choosing cluster sizes in Databricks, it's essential to conduct a benchmarking exercise. This involves comparing the costs and performance metrics before and after implementing dynamic cluster sizing. Let's assume we have collected data from multiple runs of a typical workload over a month.

Before Implementation: Static Cluster Allocation

In the static cluster allocation scenario, we use a predefined large cluster for all workloads, regardless of their size. The cluster configuration is as follows:

Cluster Configuration: 16 nodes (r5.4xlarge)
Average monthly cost: $25,000
Average startup time: 5 minutes
Average processing time per run: 1 hour
Total number of runs per month: 100

After Implementation: Dynamic Cluster Allocation

In the dynamic cluster allocation scenario, clusters are chosen based on the size of the workload. Let's assume the cluster configurations and their associated costs are as follows:

Small Cluster (2-4 nodes, i3.xlarge): $0.50 per hour per node
Medium Cluster (4-8 nodes, r5.2xlarge): $1.00 per hour per node
Large Cluster (8-16 nodes, r5.4xlarge): $2.00 per hour per node

Based on our log analysis, we categorize the runs as follows:

Small workloads: 40 runs per month
Medium workloads: 40 runs per month
Large workloads: 20 runs per month

Detailed Cost Analysis

Before Implementation

Cluster Cost Calculation:

Hourly cost per node=2.00 USD

Total hourly cost=16 nodes×2.00 USD =32 USD

Monthly cost=100 runs×1 hour per run×32 USD =3,200 USD

Startup Cost Calculation:

Startup time cost per run=5 minutes=5/60 hour×32 USD =2.67 USD

Total startup cost per month=100 runs×2.67 USD = 267 USD

Total Monthly Cost Before:

Total monthly cost=3,200 USD + 267 USD = 3,467 USD

After Implementation

Cluster Cost Calculation for Each Category:
- Small workloads:

Hourly cost per node=0.50 USD

Total hourly cost=3 nodes×0.50 USD =1.50 USD

Monthly cost for small workloads=40 runs×1 hour per run×1.50 USD =60 USD

Medium workloads:

Hourly cost per node=1.00 USD

Total hourly cost=6 nodes×1.00 USD =6.00 USD

Monthly cost for medium workloads=40 runs×1 hour per run×6.00 USD =240 USD

Large workloads:

Hourly cost per node=2.00 USD

Total hourly cost=12 nodes×2.00 USD =24 USD

Monthly cost for large workloads=20 runs×1 hour per run×24 USD =480 USD

Total monthly cost for clusters : 60 USD +240 USD +480 USD =780 USD

Startup Cost Calculation:

Assuming the dynamic allocation reduces the average startup time to 2 minutes:

Startup time cost per run=2 minutes=2/60 hour×Average hourly cost of cluster

Calculating average hourly cost of clusters:

Average hourly cost= (1.50 USD +6.00 USD +24.00 USD ) / 3 =10.50 USD

Startup time cost per run=2/60×10.50 USD =0.35 USD

Total startup cost per month : 100 runs×0.35 USD =35 USD

Total Monthly Cost After:

Total monthly cost : 780 USD +35 USD =815 USD

Summary of Cost Savings

Total monthly cost before dynamic sizing: $3,467
Total monthly cost after dynamic sizing: $815
Monthly cost savings: $3,467 - $815 = $2,652

Additional Benefits

Performance Improvements

Startup time reduction: From 5 minutes to 2 minutes, reducing waiting time by 60%.
Improved resource utilization: By right-sizing clusters, the resources are better utilized, avoiding over-provisioning.

Environmental Impact

Reduced energy consumption: Using smaller clusters for smaller workloads decreases the overall energy usage, contributing to a greener environment.

Conclusion

Implementing dynamic cluster sizing in Databricks can lead to significant cost savings and performance improvements. By leveraging historical data and cluster pools, organizations can ensure that each workload is matched with the appropriate resources, leading to optimized costs and enhanced efficiency. This approach not only saves cost but also promotes sustainable and efficient resource utilization.