When setting up compute, there are many options and knobs to tweak and tune, and it can get quite overwhelming very quickly. To help you with optimally configuring your clusters, we have broken down this topic into two articles:
In this article, we will cover the first stage - the beginner's guide.
Compute, or a compute cluster, refers to a computation resource built on one or more cloud virtual machines (VMs, also referred to as nodes) that you can use to accomplish your data engineering, data science, or AI/ML tasks.
Before we can begin the cluster creation/configuration process, we need to understand the use case required for the compute to determine the type of cluster.
There are several types of compute available on Databricks:
In addition, Databricks offers ML-specialized compute types that are tailored for specific use cases (out of the scope of this Beginners’ Guide):
Here is our recommendation on when to use it:
Once you have determined the type of compute, you are ready to start creating and configuring your cluster. You can create a Databricks cluster via the Databricks UI, Clusters API, Databricks CLI, SDK (Python/Java/Go/R), or Databricks Terraform provider.
If creating via the Databricks UI:
To create clusters via Clusters API, CLI, or Databricks Terraform provider, follow the instructions in the docs linked above.
When you create a cluster in Databricks, you must select an access mode. The access mode is a security feature that determines who can use a cluster and what data they can access via the cluster.
Available access modes for clusters can be found here.
If you are on Unity Catalog (which you should be if you are following Databricks’ recommendation), your choices are between the “single user” and “shared” access modes. Here is what each of these modes means:
If you are starting off your MLOps work on Databricks, you will find it easier to use a single-user cluster for both interactive workloads and jobs unless you are not able to do so due to limitations. In this scenario, every user or job will have their own cluster (probably with a single node, as you will see in the cluster sizing section below) and will not interfere with other users’ workloads.
The Databricks Runtime provides a pre-configured development environment complete with the operating system, Spark components, language interpreters, and analysis tools that you will need to get started quickly.
The Runtime comes in two flavors: Standard and ML. As you are working on an ML project, you most probably need the ML version as it comes with many popular Python-based data manipulation and machine learning packages (and their dependencies) pre-installed. MLflow, which is by far the most popular MLOps framework, is also pre-installed in ML runtimes.
You cannot (currently, as of April 2024) use the ML Runtime with a cluster configured for Shared Access Mode.
Databricks releases newer runtime versions with more recent versions of libraries on a regular basis. They might also add/remove other libraries to/from more recent releases. We recommend that you either choose the latest runtime version to have access to the most recent versions of software or the latest LTS (Long-Term Support) version.
If you need a particular version of a library, you are able to selectively install, upgrade, or downgrade on a per-library basis, but it is good practice to try to stay as close to the latest DBR or latest LTS DBR as possible.
Cluster policies are rules and definitions that can be imposed by your workspace administrator to limit your users’ ability to create compute clusters. It is primarily used as a governance measure to ensure consistency, enforce tagging, and control costs.
For example, if you wish to limit your users to creating only single-node, single-user clusters, you can do so by setting up cluster policies accordingly. You can mandate custom tagging to be applied to the cluster to facilitate the tracking of costs across projects, teams, and departments. These tags can then be used to aggregate Databricks and cloud provider costs for efficient cross-charging within large organizations.
Cluster policies also provide a reusable template for compute configurations that can be shared within and across teams. This way, you ensure that all the necessary dependencies are captured in the template and that the entire team works with the same compute configuration.
To learn how to set up your cluster policies, follow the documentation here, and to learn about cluster policy best practices, refer to the documentation here.
When you are starting off your MLOps journey on Databricks, the most common scenarios are the following:
If any of these conditions apply to you, then a single node cluster provides all of the capabilities you will need without incurring the high cost of a cluster with many worker nodes. Using a single-node cluster is just like using a personal VM. You can size it like a personal laptop or size it to a larger machine if you're working with GBs of data. In addition, you can read and write from UC tables and volumes.
When we are choosing a cluster, even when it is a single node, there are multiple aspects that you should consider. CPU, disk, and memory are parameters that you can experiment with and make decisions on when looking for the best configuration for your workload.
The optimal number of CPU cores depends on your specific use case, workload characteristics, budget constraints, and the scalability requirements of your application.
By default, pure Python code, Pandas, or sklearn algorithms run on a single CPU core. However, there are mechanisms to leverage multiple cores for such tasks.
For example,
Here is our recommendation:
A VM's disk in the cloud is primarily focused on hosting the operating system, software, and runtime environment of the virtual machine. It serves immediate, dynamic, and temporary storage needs associated with the VM's operation.
Due to this, it can often be difficult to estimate how much disk you would need to run your workload. Fortunately, cloud providers allow you to have autoscaling on your local disk.
Selecting the right memory size for a virtual machine involves careful analysis of the operating system, applications, workload characteristics, scalability needs, and budget considerations. When you choose a VM with a certain memory size, a good portion of that memory is allocated to the OS and running applications on it. As a result, choosing a memory of, for example, 14GB does not mean you have 14GB of memory to hold the data you are processing.
To determine how much memory you have available, navigate to the cluster’s page > Metrics and check out the Memory utilization graph. In the example below, you can see that out of a total of 14GB VM memory, there is about 5.28GB free after the OS takes the memory it requires.
Additionally, allocating VMs with excessive memory can be problematic due to the increased cost of garbage collection (GC) for large heaps.
Another factor to take into consideration when choosing memory is the fact that the data size on disk could be different than the data size in memory. Factors such as compression, serialization, additional structures, data representation, and processing overhead might contribute to this discrepancy in one way or another.
To determine what memory size you need for your data processing, read a portion of the data, for example, using Pandas, and check out the memory usage in the Metrics tab. You can then extrapolate how much data is going to be needed for the entire dataset.
Once you have identified CPU, disk, and memory requirements for your workload, you can scan through available VM types on your cloud provider and choose the best one accordingly.
In other scenarios, when you need to leverage Spark’s distribution for distributed inference or training, etc., the question that often comes up in the context of choosing a cluster is “What cluster size should I choose?” On a single-node cluster, Spark can still leverage multiple CPUs to parallelise tasks, but in most scenarios, and as your data gets bigger, you are looking into sizing a multi-node cluster. When choosing a cluster, there are multiple aspects that you should consider. CPU, disk, and memory of your cluster are parameters that you can experiment with and make decisions on when looking for the best configuration for your workload.
Stay tuned for the upcoming "Advanced Cluster Configuration for MLOps" blog for more details.
This article is a beginner's guide to cluster configuration for MLOps using Databricks, detailing how to choose the right type of compute cluster, create clusters, manage access, set policies, size clusters, and select runtimes.
We recommend using interactive clusters for development and job clusters for automated tasks to control costs. We discuss the importance of selecting the right access mode, with options for single-user or shared clusters, and the significance of cluster policies for managing resources and costs. We also cover how to size a cluster based on CPU, disk, and memory needs, as well as the importance of selecting the right Databricks Runtime, highlighting the differences between Standard and ML runtimes and the need to stay updated with the latest versions.
Check out the official Databricks Cluster Configuration Best Practices documentation (AWS | Azure | GCP) and general Databricks Compute documentation (AWS | Azure | GCP) for further advice. Basics of Databricks Workflows - Part 3: The Compute Spectrum also provides a great overview of compute resources on Databricks with a focus on Data Engineering.
Next blog in this series: MLOps Gym - Databricks Feature Store
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.