This is the second part of our two-part series on cluster configuration best practices for MLOps use cases on Databricks. Part one, Beginners Guide to Cluster Configuration for MLOps covers essential topics such as selecting the right type of compute cluster, creating and managing clusters, setting policies, determining appropriate cluster sizes, and choosing the optimal runtime environment. In this second part, we will skip over some of the basic concepts and we will show you how to choose the best cluster configuration for more sophisticated MLOps workloads and usage patterns.
Serverless compute is the best way to run workloads on Databricks. It is fast, simple and reliable. In scenarios where serverless compute is not available for a myriad of reasons, you can fall back on classic compute and follow the instructions in this article. |
In the beginner's guide, we recommend starting by using single-user clusters to lower the entry barrier. However, there are scenarios in which you might want to consider using shared clusters.
One major motivator is cost. When using shared clusters, you can potentially save cost by sharing resources and minimizing idle cluster time. Another advantage of shared clusters is startup time: if you (as recommended) set your (single user or shared) clusters to terminate after a period of inactivity, then every team member needs to wait a few minutes e.g. each morning for their single-user cluster to start up. While with a shared cluster for the team, it only has to start up about once a day for a medium-sized team.
However resource and workload isolation could get challenging. Therefore shared clusters typically have more limited functionality compared to single-user clusters. One limitation that stands out in the MLOps context, is the lack of support for ML runtimes which could be a non-starter if you are running ML workloads. You can have a look at all shared cluster limitations here.
Cluster sizing is one of the most challenging parts of cluster configurations. If you need to use Spark for your workloads, you might ask yourself questions such as:
When choosing a cluster, there are multiple aspects that you should consider. The CPU, disk, and memory of your cluster are parameters that you can experiment with and decide on when looking for the best configuration for your workload.
Let’s examine what happens when a cluster reads some data and runs some transformation on it.
Here are our assumptions:
Number of workers: 2
Number of CPU cores per worker: 2
Number of cores allocated to each Spark task (spark.task.cpus): 1
Number of data partitions: 4
In this scenario, each worker reads 2 partitions simultaneously and loads them in the shared memory of the worker. Here is the first point where you have to think about how to size your worker memory so that both partitions fit in it at the same time. Note that the data size in memory is not exactly the same as the size you see in your cloud storage because Spark would use a different compression than the file format your data is in.
Once the partitions are read in memory, each CPU will start processing one partition and if a shuffle is required, the shuffle partitions are written to the worker’s local disk. This is the next point when you should think about how to size your disk so that shuffle partitions fit in it and also read/write happens as fast as possible.
Another thing that you might want to consider is whether you need 1 or more CPUs for each task. In this example we have allocated one CPU per task but for some more CPU-intensive tasks you might want to allocate more than one CPU.
In the following section, we will explore CPU, disk, and memory sizing in more detail and explain the decision-making process.
Spark performs well with machines that have numerous CPU cores because there is limited resource sharing between threads. To set up each executor effectively, it is suggested to start with a minimum of 8 Spark cores. The number of cores per executor can be increased, up to a maximum of 16 cores, depending on the application's level of CPU intensity.
Once again, these are general guidelines, and the necessity for additional cores may vary based on the workload. Typically, when an application's data is in memory, it tends to experience constraints related to CPU or network usage. The best way of reviewing available resources is to look at the Spark UI available on http://<spark-master-node>:8080 (for vanilla Spark distributions) and in Databricks under Clusters > Spark UI > Executors.
The types and quantity of disks on one machine play a big role in job throughput. When handling shuffle partitions, they first write to local disks and are later read from the selected node or nodes to finish the shuffle process. If you use an SSD instead of a spinning disk, the reading and writing happen much faster. So, choose nodes with SSDs for local storage and use Auto Scaling Local Storage to add more as needed.
Here is a comparison ofthe SSD size of different VMs on AWS. Notice that as the nodes increase in size, the default local SSD sizes and counts also increase, enabling larger shuffles without the need for extra SSDs initially. This leads to significantly enhanced performance.
Allocating nodes with excessive memory can be problematic due to the increased cost of garbage collection (GC) for large heaps. To precisely determine the required memory for a particular application, load and cache a percentage of the entire dataset in memory. Visit the storage tab (Spark UI > Storage) and examine the dataset's size at rest in memory. Extrapolate this information to estimate the total size if the entire dataset were to be loaded into memory.
Memory-to-disk spillage in Spark refers to the process of moving data from RAM to disk and back again. This usually occurs when a given partition is too large to fit into RAM, forcing Spark into potentially expensive disk reads and writes to free up local RAM. This can impact performance, as reading and writing to disk is slower compared to in-memory operations. Proper tuning of Spark configurations, such as adjusting memory settings or partition sizes, can help mitigate or prevent memory-to-disk spillage.
To mitigate data spill, your goal should be to have less data on a single partition. Increasing the number of partitions can help reduce the size of each partition and thus reduce the likelihood of memory-to-disk spillage.
If your workload uses little data or is non-distributed, for example, pure Python, single thread Pandas, or machine learning libraries such as Sklearn, then a single-node cluster would do the job. But if you are planning to run a Spark job with big volumes of data, then you should choose a multi-node cluster.
There are some general guidelines for deciding how many nodes you need for your cluster. We recommend enabling autoscaling between 2 and n nodes. You should decide what n considering your budget and the number of partitions you want to process simultaneously. By default, Spark processes one partition per available core. The number of available CPU cores should not exceed the number of available partitions. If not, some cores will sit idle.
Another thing to consider is data skew. If your data is partitioned so that some partitions are considerably larger than the others this will cause some of the cores to be busy longer than the rest. This means the cores with smaller partitions will finish their tasks and sit idle waiting for the bigger partitions to be processed which is a waste of your resources. In this scenario, you might want to change your partitioning strategy to have more balanced partitions and increase the number of your available cores to increase parallel processing if needed.
As you can see from the AWS example below, there is a continuously expanding array of node types to choose from on your cloud provider of choice.
Node Type |
Memory |
CPU |
Storage |
Use Case |
r4 |
Average |
Average |
Poor |
ETL – NO SLA |
r3 |
Average |
Average |
Excellent |
ETL – NO SLA |
c4 |
Poor |
Excellent |
Poor |
ML |
c3 |
Poor |
Excellent |
Excellent |
ML |
i3 |
Good |
Average |
Excellent |
ETL |
i2 |
Good |
Average |
Excellent |
ETL |
p2 |
Average |
Excellent |
Poor |
DL |
m4 |
Poor |
Poor |
Poor |
Housekeeping |
One specific cluster type that is worth calling out and might help you with cost saving is AWS fleet instance type which is now available on Databricks. Using fleet types you can choose a size for example xlarge and your node will resolve to an xlarge general-purpose instance type that has the best spot capacity and price. The remainder of this section aims to assist you in deciding the optimal types and quantity of nodes for your workload optimization.
Drivers typically don't engage in resource-intensive tasks like workers because their primary role is to coordinate worker nodes. However, there are exceptions, such as:
These instances highlight scenarios where having a larger driver is beneficial. In general, though, drivers don't need to be very large. It's essential to consider the specific tasks the driver will perform and size it accordingly.
One consideration when choosing a worker node type is that the maximum concurrency you can achieve is the maximum number of CPU cores in your cluster. For example, if you have 2 worker nodes each with 8 CPU cores, you can have a maximum of 2* 8 = 16 concurrent tasks.
The second consideration is the size of your data partitions. The data concurrently processed by different cores on a node should fit in the shared memory of that node, so choose your worker machine accordingly.
Databricks pools can be useful in certain scenarios where you want to manage and optimize resources in a Databricks workspace. Databricks pools are designed to provide better resource utilization and performance by preallocating a set of virtual machines (VMs) for use by multiple clusters. Here are some scenarios when you might consider using Databricks pools:
It's important to note that while Databricks pools offer benefits in certain scenarios, they might not be necessary for all use cases. The decision to use pools should be based on your specific workload patterns, resource requirements, and cost considerations. It's recommended to assess your requirements and experiment with different configurations to find the most efficient setup for your use case.
This blog is the second part of a series on cluster configuration for MLOps. While the first part covered foundational topics like choosing the right cluster type, creating and managing clusters, and runtime selection, this installment dives deeper into optimizing cluster configurations for advanced MLOps use cases. Here are the key highlights:
This guide emphasizes understanding workload patterns and fine-tuning configurations for optimized performance and cost-efficiency in MLOps on Databricks.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.