Databricks Community

SepidehEb

This is the second part of our two-part series on cluster configuration best practices for MLOps use cases on Databricks. Part one, Beginners Guide to Cluster Configuration for MLOps covers essential topics such as selecting the right type of compute cluster, creating and managing clusters, setting policies, determining appropriate cluster sizes, and choosing the optimal runtime environment. In this second part, we will skip over some of the basic concepts and we will show you how to choose the best cluster configuration for more sophisticated MLOps workloads and usage patterns.

Serverless compute is the best way to run workloads on Databricks. It is fast, simple and reliable. In scenarios where serverless compute is not available for a myriad of reasons, you can fall back on classic compute and follow the instructions in this article.

What Access mode should I choose?

In the beginner's guide, we recommend starting by using single-user clusters to lower the entry barrier. However, there are scenarios in which you might want to consider using shared clusters.

One major motivator is cost. When using shared clusters, you can potentially save cost by sharing resources and minimizing idle cluster time. Another advantage of shared clusters is startup time: if you (as recommended) set your (single user or shared) clusters to terminate after a period of inactivity, then every team member needs to wait a few minutes e.g. each morning for their single-user cluster to start up. While with a shared cluster for the team, it only has to start up about once a day for a medium-sized team.

However resource and workload isolation could get challenging. Therefore shared clusters typically have more limited functionality compared to single-user clusters. One limitation that stands out in the MLOps context, is the lack of support for ML runtimes which could be a non-starter if you are running ML workloads. You can have a look at all shared cluster limitations here.

Cluster sizing

Cluster sizing is one of the most challenging parts of cluster configurations. If you need to use Spark for your workloads, you might ask yourself questions such as:

How many nodes should my cluster have?
What machine types should I use for the driver and workers?

When choosing a cluster, there are multiple aspects that you should consider. The CPU, disk, and memory of your cluster are parameters that you can experiment with and decide on when looking for the best configuration for your workload.

Let’s examine what happens when a cluster reads some data and runs some transformation on it.

MLOps gym.png Here are our assumptions:

Number of workers: 2
Number of CPU cores per worker: 2
Number of cores allocated to each Spark task (spark.task.cpus): 1
Number of data partitions: 4

In this scenario, each worker reads 2 partitions simultaneously and loads them in the shared memory of the worker. Here is the first point where you have to think about how to size your worker memory so that both partitions fit in it at the same time. Note that the data size in memory is not exactly the same as the size you see in your cloud storage because Spark would use a different compression than the file format your data is in.

Once the partitions are read in memory, each CPU will start processing one partition and if a shuffle is required, the shuffle partitions are written to the worker’s local disk. This is the next point when you should think about how to size your disk so that shuffle partitions fit in it and also read/write happens as fast as possible.

Another thing that you might want to consider is whether you need 1 or more CPUs for each task. In this example we have allocated one CPU per task but for some more CPU-intensive tasks you might want to allocate more than one CPU.

In the following section, we will explore CPU, disk, and memory sizing in more detail and explain the decision-making process.

CPU

Spark performs well with machines that have numerous CPU cores because there is limited resource sharing between threads. To set up each executor effectively, it is suggested to start with a minimum of 8 Spark cores. The number of cores per executor can be increased, up to a maximum of 16 cores, depending on the application's level of CPU intensity.

Once again, these are general guidelines, and the necessity for additional cores may vary based on the workload. Typically, when an application's data is in memory, it tends to experience constraints related to CPU or network usage. The best way of reviewing available resources is to look at the Spark UI available on http://<spark-master-node>:8080 (for vanilla Spark distributions) and in Databricks under Clusters > Spark UI > Executors.

Screenshot 2025-01-27 at 13.02.54.png

Disk

The types and quantity of disks on one machine play a big role in job throughput. When handling shuffle partitions, they first write to local disks and are later read from the selected node or nodes to finish the shuffle process. If you use an SSD instead of a spinning disk, the reading and writing happen much faster. So, choose nodes with SSDs for local storage and use Auto Scaling Local Storage to add more as needed.

Here is a comparison ofthe SSD size of different VMs on AWS. Notice that as the nodes increase in size, the default local SSD sizes and counts also increase, enabling larger shuffles without the need for extra SSDs initially. This leads to significantly enhanced performance.

Memory

Allocating nodes with excessive memory can be problematic due to the increased cost of garbage collection (GC) for large heaps. To precisely determine the required memory for a particular application, load and cache a percentage of the entire dataset in memory. Visit the storage tab (Spark UI > Storage) and examine the dataset's size at rest in memory. Extrapolate this information to estimate the total size if the entire dataset were to be loaded into memory.

Screenshot 2025-01-27 at 13.03.29.png

What is disk spillage and why should you avoid it?

Memory-to-disk spillage in Spark refers to the process of moving data from RAM to disk and back again. This usually occurs when a given partition is too large to fit into RAM, forcing Spark into potentially expensive disk reads and writes to free up local RAM. This can impact performance, as reading and writing to disk is slower compared to in-memory operations. Proper tuning of Spark configurations, such as adjusting memory settings or partition sizes, can help mitigate or prevent memory-to-disk spillage.

To mitigate data spill, your goal should be to have less data on a single partition. Increasing the number of partitions can help reduce the size of each partition and thus reduce the likelihood of memory-to-disk spillage.

Single node or multi-node?

If your workload uses little data or is non-distributed, for example, pure Python, single thread Pandas, or machine learning libraries such as Sklearn, then a single-node cluster would do the job. But if you are planning to run a Spark job with big volumes of data, then you should choose a multi-node cluster.

How many worker nodes?

There are some general guidelines for deciding how many nodes you need for your cluster. We recommend enabling autoscaling between 2 and n nodes. You should decide what n considering your budget and the number of partitions you want to process simultaneously. By default, Spark processes one partition per available core. The number of available CPU cores should not exceed the number of available partitions. If not, some cores will sit idle.

Another thing to consider is data skew. If your data is partitioned so that some partitions are considerably larger than the others this will cause some of the cores to be busy longer than the rest. This means the cores with smaller partitions will finish their tasks and sit idle waiting for the bigger partitions to be processed which is a waste of your resources. In this scenario, you might want to change your partitioning strategy to have more balanced partitions and increase the number of your available cores to increase parallel processing if needed.

Selecting node type with the right CPU, Disk, and Memory

As you can see from the AWS example below, there is a continuously expanding array of node types to choose from on your cloud provider of choice.

Node Type	Memory	CPU	Storage	Use Case
r4	Average	Average	Poor	ETL – NO SLA
r3	Average	Average	Excellent	ETL – NO SLA
c4	Poor	Excellent	Poor	ML
c3	Poor	Excellent	Excellent	ML
i3	Good	Average	Excellent	ETL
i2	Good	Average	Excellent	ETL
p2	Average	Excellent	Poor	DL
m4	Poor	Poor	Poor	Housekeeping

One specific cluster type that is worth calling out and might help you with cost saving is AWS fleet instance type which is now available on Databricks. Using fleet types you can choose a size for example xlarge and your node will resolve to an xlarge general-purpose instance type that has the best spot capacity and price. The remainder of this section aims to assist you in deciding the optimal types and quantity of nodes for your workload optimization.

What size should my driver be?

Drivers typically don't engage in resource-intensive tasks like workers because their primary role is to coordinate worker nodes. However, there are exceptions, such as:

When large datasets are sent back to the driver.
When the driver is involved in complex graphics or visualizations.
When the job involves broadcasting large variables, requiring sufficient memory on the driver.
When parallel or threading tasks are executed on the driver using parallelized arrays or threading.

These instances highlight scenarios where having a larger driver is beneficial. In general, though, drivers don't need to be very large. It's essential to consider the specific tasks the driver will perform and size it accordingly.

What size should my workers be?

One consideration when choosing a worker node type is that the maximum concurrency you can achieve is the maximum number of CPU cores in your cluster. For example, if you have 2 worker nodes each with 8 CPU cores, you can have a maximum of 2* 8 = 16 concurrent tasks.

The second consideration is the size of your data partitions. The data concurrently processed by different cores on a node should fit in the shared memory of that node, so choose your worker machine accordingly.

When should I use pools?

Databricks pools can be useful in certain scenarios where you want to manage and optimize resources in a Databricks workspace. Databricks pools are designed to provide better resource utilization and performance by preallocating a set of virtual machines (VMs) for use by multiple clusters. Here are some scenarios when you might consider using Databricks pools:

Cost Efficiency: If you have multiple clusters that run intermittently, using a pool can be more cost-effective than maintaining individual clusters. Pools allow you to share a set of VMs across multiple clusters, reducing the overall cost.
Resource Management: Pools help in managing resources more efficiently. Instead of allocating and deallocating VMs for each individual cluster, you can keep a set of VMs in the pool, and clusters can be dynamically attached to or detached from the pool as needed.
Cluster Sharing: If you have multiple teams or users who need to run clusters but can share resources, pools provide a way to allocate a fixed set of resources for everyone to use. This avoids resource contention and ensures fair usage.
Job Isolation: If you have a mix of short-running and long-running jobs, you can use pools to isolate long-running jobs from short-running ones. This helps to avoid resource contention and ensures that critical jobs are not impacted by less critical ones.
Scaling: Pools can simplify the scaling process. Instead of manually adjusting the number of VMs for each cluster, you can adjust the size of the pool, and all clusters associated with the pool will automatically scale accordingly.
Automated Scaling: Pools can be used with Databricks Auto Scaling to automatically adjust the number of VMs in the pool based on the workload. This helps in optimizing resources and ensures that you are using the right amount of compute power.
High Concurrency: In scenarios where you have high concurrency with many users running clusters simultaneously, pools can help manage resources more effectively and prevent individual users or teams from monopolizing resources.

It's important to note that while Databricks pools offer benefits in certain scenarios, they might not be necessary for all use cases. The decision to use pools should be based on your specific workload patterns, resource requirements, and cost considerations. It's recommended to assess your requirements and experiment with different configurations to find the most efficient setup for your use case.

Summary

This blog is the second part of a series on cluster configuration for MLOps. While the first part covered foundational topics like choosing the right cluster type, creating and managing clusters, and runtime selection, this installment dives deeper into optimizing cluster configurations for advanced MLOps use cases. Here are the key highlights:

Serverless compute is the preferred option for simplicity, speed, and reliability. However, when unavailable, classic compute can be optimized using the guidance provided.
Shared clusters offer cost savings and reduced startup times by sharing resources, but they lack full support for ML runtimes and have limited workload isolation. Therefore, Single-User clusters are ideal for better isolation and ML-specific workloads but require more resources.
Cluster Sizing

Workers: Consider CPU cores, memory, and disk size to match workload requirements while avoiding bottlenecks like memory-to-disk spillage.
Drivers: Usually less resource-intensive unless handling large data broadcasts or visualizations.
Autoscaling: Enable autoscaling for cost efficiency, adjusting cluster size based on workload.

Key Resource Considerations

CPU: Allocate sufficient cores for Spark tasks, starting with 8–16 cores per executor, based on workload intensity.
Memory: Avoid excessive memory allocation to reduce garbage collection overhead. Use Spark UI to fine-tune memory usage.
Disk: Opt for SSDs for faster shuffle operations and monitor disk spillage to optimize performance.

This guide emphasizes understanding workload patterns and fine-tuning configurations for optimized performance and cost-efficiency in MLOps on Databricks.

Databricks Community

MLOps Gym - Advanced Guide to Cluster Configuration for MLOps