Databricks Community

msahil · ‎04-19-2024

Introduction

Cost optimisation remains a pivotal challenge for customers dealing with processing large volumes of data and machine learning model training at scale in the cloud. Spot instances have revolutionised how organisations approach cloud computing, offering a cost-effective alternative to on-demand and reserved instances. For spot instances, users can bid for unused computing capacity at significantly lower prices, making them an attractive option for running specific data and AI workloads that are fault-tolerant and flexible in execution time. Databricks provides customers with options to use on-demand, reserved instances, and spot instances for clusters to process these workloads. However, using spot instances comes with challenges, including the risk of interruptions and the need for a robust strategy to manage these instances effectively.

This blog provides a comprehensive guide on the best practices for using spot instances to optimise data and AI workload costs. We will explore the critical considerations for identifying and selecting workloads suitable for spot instances, strategies for managing spot instance interruptions, and recommendations for blending spot and on-demand instances for Databricks clusters to achieve cost efficiency without sacrificing reliability.

On-Demand, Reserved, and Spot Instances

A key consideration for cost optimisation of data and AI workloads is knowing when to use on-demand, reserved and spot instances, as they differ in pricing, availability and flexibility. Here are some thoughts:

On-Demand Instances: These are ideal for quickly testing new workloads or unpredictable workloads that cannot be interrupted. They offer flexibility without any long-term commitments.
Reserved Instances: Use reserved instances once a specific workload or environment with a predictable usage pattern is identified, such as an overnight batch use case or the production environment. They offer up to 75% cost savings over on-demand instances with a one—or three-year commitment. Unlike spot instances, reserved instances are never evicted, so no special considerations are required to use them.
Spot Instances: Spot instances, being the cheapest, offer up to 90% cost savings over on-demand instances. However, depending on the cloud vendor, they also come with the possibility of eviction under short notice of 30 seconds to 2 minutes. Hence, a proper approach to considering them is needed, further outlined in this blog.

Guiding Principles For Use Case and Workload Selection

Before diving further, it's crucial to assess organisational readiness and have a set of guiding principles that can be used to select workloads appropriate for spot instances. The key considerations are:

Non-time Sensitive Workloads: Workloads that can tolerate disruptions and are not bound by strict time constraints.
Fault-Tolerant Workloads: Workloads that can be checkpointed and restarted without losing significant progress.
Seasonal and Scalable Workloads: Workloads that see seasonal spikes in data volume during particular durations.
Development and Staging Environments: Given their lower criticality, workloads in these environments can run a higher ratio of spot to on-demand instances than production environments.

Applying these principles, some of the workloads or use cases that could potentially be considered for spot instances are:

Machine learning model training at scale, like training deep learning models
Pre-training and fine-tuning LLMs with domain-specific datasets
Batch jobs running overnight for machine learning prediction on large datasets
Batch jobs running ETL pipelines overnight for large-sized dataset ingestion and processing
BI dashboards that require crunching large datasets are refreshed overnight and shared across CXOs daily to monitor organization-wide operational metrics.

Azure and AWS provide spot instances with CPUs and GPUs across multiple sizes that can be used based on individual workload needs.

Recommended Approach: Start > Monitor > Improvise > Expand

Once the target workloads have been identified, it's important to have a process to test out the right mix of spot instances before rolling them out across all teams and environments.

The recommended process is to:

Start Small: To evaluate the effectiveness of spot instances, begin with a single cluster and limit the evaluation to a small group of developers. When picking up a cluster processing a workload, choose the one with high monthly usage, where the cost benefits can be significant compared to the total monthly cost of managing the environment.
Monitor Performance and Costs: Monitor eviction histories and bid prices closely to ensure they align with average availability price points. If the bid prices are too far away, it's possible to miss out on using spot instances in clusters. Remember, spot pricing varies by region and changes frequently, so keeping track of it is crucial. AWS provides this information in Spot Instance Advisor and Azure within the Azure Portal, and this information can even be ingested into your lakehouse and monitored for more effective spot instance usage.
Improvise Based on Learnings: Adjust your strategy based on initial experiences, selecting spot instances with lower eviction rates and tweaking bid prices.
Expand Gradually: Once you have identified the suitable configurations, roll out the best practices learned from the initial implementation to other teams and workloads across the organisation.

Spot Allocation In Databricks Clusters

Clusters in Databricks can run on 100% on-demand or reserved instances on driver and worker nodes. But to achieve the best price/performance ratio for a workload, here are key considerations for using the right instance type for driver and worker nodes:

Driver Node: A hybrid cluster approach is recommended. The driver node always uses an on-demand or reserved instance to ensure that the driver node maintains the cluster state even when all worker nodes use spot instances and receive an eviction notice. The cluster will eventually recover with new spot instances from the available pool or fall back to on-demand or reserved instances. If a driver node uses a spot instance (not recommended) and receives an eviction notice from the cloud provider, the cluster will terminate and not process the workload any more.
Worker Nodes: Use a mix of on-demand or reserved and spot instances, starting with a smaller allocation of spot instances, say 25%, and then gradually increasing while adhering to the SLA and business requirements based on the environment like development, staging and production.

Cluster Behaviour On Spot Eviction or Non-Availability

To better understand how Databricks manages the fallback operation internally, let's assume a scenario with a cluster with the driver node using reserved instances and eight worker nodes where four nodes are configured to use reserved instances and another four to use spot instances.

When a cluster with spot mix is started

Depending on the number of available spot instances in the cloud provider pool, the cluster will start with the following instance types:

When four spot instances are available in the pool, the cluster will have four spot and four reserved instances for worker nodes
When only two spot instances are available in the pool, the cluster will have two spot and six reserved instances for worker nodes.
When no spot instances are available in the pool, the cluster will start with eight reserved instances for worker nodes.

When spot instances are evicted in a running cluster

When one or all spot instances are evicted, Databricks will attempt to acquire the same number of new spot instances for replacement from the pool or use new reserved instances during fallback to match the desired cluster capacity that will depend on the cluster configurations at the setup time.

The key takeaway is that you would always retain the configured cluster capacity regarding the number of nodes when using spot instances with Databricks clusters. On eviction, the worker nodes would either fallback to new spot instances from the available pool or fallback to on-demand or reserved instances, depending on cluster configuration.

Setting Up Clusters With Spot Instances

Create a new cluster using the API (Azure, AWS) to set up clusters with spot instances. Properties to set in the API are:

“first_on_demand”: The value for this property should always be greater than ‘1’ to make sure that the driver node is always an on-demand or reserved instance.
“availability” is used with the “first_on_demand” value to set the cluster's desired number of spot instances.
“spot_bid_max_price”: While this property is optional, it can define the maximum price for spot bids if the total cost of running the workload must be under a strict threshold.

Enable Decommissioning

As we know, spot instances can be evicted anytime by cloud providers on short notice, which can cause issues with jobs that are running, including:

Shuffle fetch failures
Shuffle data loss
RDD data loss
Job failures

Use decommissioning (Azure, AWS) to help address these issues, which takes advantage of the notification that the cloud provider sends before a spot instance is to be decommissioned, which enables the decommissioning process to attempt to migrate shuffle and RDD data to healthy worker nodes.

Spot Instances In Production Environments

Looking above, it's evident that once the correct process is identified and validated, spot instances can also be used in production environments for specific workloads, providing the best price/performance ratio and reducing the TCO, especially for organisations handling large-scale data and AI workloads.

Summary And Key Takeaways

Assess Workload Suitability: Not all workloads are suitable for spot instances. Evaluate fault tolerance, statelessness, and flexibility before opting for spot instances.
Start Small and Monitor: Start with a small implementation, closely monitor performance and costs, and adjust your strategy based on the lessons learned.
Use a Hybrid Approach: Combining on-demand/reserved instances with spot instances in clusters ensures reliability while optimising costs.
Enable Decommissioning: To mitigate issues caused by spot instance eviction, enable decommissioning to receive notifications and attempt data migration.
Monitor and Adjust Bidding Strategies: Monitor spot instance performance and adjust bid prices to maximise usage and cost savings.

Serverless Compute vs. Classic Compute

Another option for cost optimisation of Data and AI workloads is using serverless computing in Databricks, which offers several advantages over classic compute with on-demand, reserved and spot instances:

Managed Infrastructure: With serverless computing, Databricks manages the computing resources within your Databricks account, reducing the need for you to manage, configure, or scale cloud infrastructure, which allows your data team to focus more on their core tasks.
Cost Efficiency: Serverless compute can lower overall costs by an average of 40% through intelligent workload management, faster reads from cloud storage, and automatic determination of instance types and configurations for the best price/performance.
High Concurrency and Automatic Load Balancing: Serverless compute is designed to automatically scale to provide unlimited concurrency without disruption, making it suitable for high concurrency use cases.
Instant Startup and Greater Availability: Serverless compute offers instant startup and greater availability, improving the user experience and productivity.
Security: Serverless compute runs within a network boundary for the workspace, with various layers of security to isolate different Databricks customer workspaces and additional network controls between clusters of the same customer.
Reduced Idle Costs: Serverless compute aggressively optimises away idle costs, providing instant, secure, and zero-management compute.
Simplified Compute Management: Serverless Jobs significantly simplifies compute configuration by eliminating customer-managed cloud infrastructure and simplifying compute management.

Enabling serverless compute remains the same as how Databricks Runtime clusters work in the Data Science and Engineering or Databricks Machine Learning environments.

Call To Action

Embrace the transformative potential of spot instances for your data and AI workloads. Begin by assessing your organisational readiness and selecting suitable fault-tolerant and flexible workloads. Start small to understand the nuances of managing spot instances, monitor their performance closely, and gradually expand their use across your organisation. By adopting a strategic approach to integrating spot instances with on-demand and reserved instances, you can significantly reduce your total cost of ownership while maintaining reliability. Take advantage of the opportunity to optimise your cloud computing costs, and start leveraging the power of spot instances now.