Databricks Community

SepidehEb · ‎04-18-2024

When setting up compute, there are many options and knobs to tweak and tune, and it can get quite overwhelming very quickly. To help you with optimally configuring your clusters, we have broken down this topic into two articles:

Beginners Guide to Cluster Configuration for MLOps - where to start when you are at the early stages of your MLOps journey.
Advanced Cluster Configuration for MLOps - how to choose the best configuration for more sophisticated MLOps workloads and usage patterns.

In this article, we will cover the first stage - the beginner's guide.

What type of compute should I choose?

Compute, or a compute cluster, refers to a computation resource built on one or more cloud virtual machines (VMs, also referred to as nodes) that you can use to accomplish your data engineering, data science, or AI/ML tasks.

Before we can begin the cluster creation/configuration process, we need to understand the use case required for the compute to determine the type of cluster.

There are several types of compute available on Databricks:

All-purpose compute, a.k.a. Interactive clusters, refer to a computing cluster that is suitable for interactive workloads such as ad-hoc analysis, data exploration, and development. You can attach your notebooks to an interactive cluster and run your code (Python, SQL, Scala, R) on it. Using this type of cluster is necessary for the early stages of your project when you are developing your code and experimenting. To control costs, you can define them to terminate after a predefined time of inactivity.
Job compute, a.k.a. Job clusters, are often used when you're ready to operationalize your code (Python, SQL, Scala, R). Job clusters are designed to terminate automatically once the specific job or task is completed and are more cost-efficient than all-purpose clusters. Depending on the Cloud and Tier you are on, you can save between 45% to 75% of the cost by switching from interactive to jobs clusters.
SQL warehouses are compute resources that allow you to run SQL commands on Databricks.
Pools are a set of idle, ready-to-use instances that you might share between different workloads.

In addition, Databricks offers ML-specialized compute types that are tailored for specific use cases (out of the scope of this Beginners’ Guide):

Vector Search facilitates speedy, semantic-based searches by indexing data as vectors, enhancing accuracy in areas like natural language processing. There are two components to vector search: a storage component and a compute component. The storage component shows up in the Catalog tab on Databricks, and you can find the compute component under the compute tab.
Model Serving allows you to serve your ML models at a REST API endpoint. It is low latency and serverless, and it comes in different flavours. Model serving supports both CPU and GPU, enabling serving for both traditional ML models and Large Language Models (LLMs).
Online stores provide low latency data serving capabilities. It is designed to simplify the process of serving data from analytical systems to real-time ML models and applications, such as retrieving pre-computed features for real-time inference.

Here is our recommendation on when to use it:

Use an interactive cluster if you are developing a solution, exploring, or doing ad-hoc analysis.
Use a job cluster if you have a piece of code you want to automate (run as a workflow/ job on Databricks). We strongly recommend against using interactive clusters for your jobs. Job clusters are typically less costly than interactive clusters, so you can automate production jobs at a lower cost. However, since they can only be used to automate jobs, you cannot interact with job clusters the way you do with an interactive cluster.

Use a SQL warehouse for SQL workloads. Databricks recommends the adoption of serverless SQL warehouses to take advantage of its advanced optimisation techniques.

How can I create a cluster?

Once you have determined the type of compute, you are ready to start creating and configuring your cluster. You can create a Databricks cluster via the Databricks UI, Clusters API, Databricks CLI, SDK (Python/Java/Go/R), or Databricks Terraform provider.

If creating via the Databricks UI:

To create an interactive cluster via the UI, you should navigate to the Compute tab in the Databricks UI and hit “Create compute.”
To create a job cluster in the UI, navigate to Workflows > Jobs and hit “Create job.” When creating your job, you are able to define the configurations to create a new job cluster. Creating a job cluster outside the context of a job is not possible.

To create clusters via Clusters API, CLI, or Databricks Terraform provider, follow the instructions in the docs linked above.

What access mode should I choose?

When you create a cluster in Databricks, you must select an access mode. The access mode is a security feature that determines who can use a cluster and what data they can access via the cluster.

Available access modes for clusters can be found here.

If you are on Unity Catalog (which you should be if you are following Databricks’ recommendation), your choices are between the “single user” and “shared” access modes. Here is what each of these modes means:

A single-user cluster can be assigned to and used by a single user. This cluster access mode supports all languages available on Databricks (Python, SQL, Scala, R).
A shared cluster can be used by multiple users and is designed to provide strict process isolation amongst users. This provides stronger security guarantees than would otherwise be possible if multiple users were to share a ‘single user’ cluster. However, this cluster access mode only supports Python, SQL, and, most recently, Scala on more recent runtime versions.

If you are starting off your MLOps work on Databricks, you will find it easier to use a single-user cluster for both interactive workloads and jobs unless you are not able to do so due to limitations. In this scenario, every user or job will have their own cluster (probably with a single node, as you will see in the cluster sizing section below) and will not interfere with other users’ workloads.

What runtime should I choose?

The Databricks Runtime provides a pre-configured development environment complete with the operating system, Spark components, language interpreters, and analysis tools that you will need to get started quickly.

Runtime flavors

The Runtime comes in two flavors: Standard and ML. As you are working on an ML project, you most probably need the ML version as it comes with many popular Python-based data manipulation and machine learning packages (and their dependencies) pre-installed. MLflow, which is by far the most popular MLOps framework, is also pre-installed in ML runtimes.

You cannot (currently, as of April 2024) use the ML Runtime with a cluster configured for Shared Access Mode.

Versioning

Databricks releases newer runtime versions with more recent versions of libraries on a regular basis. They might also add/remove other libraries to/from more recent releases. We recommend that you either choose the latest runtime version to have access to the most recent versions of software or the latest LTS (Long-Term Support) version.

If you need a particular version of a library, you are able to selectively install, upgrade, or downgrade on a per-library basis, but it is good practice to try to stay as close to the latest DBR or latest LTS DBR as possible.

Check out runtime release notes (AWS, Azure, GCP) to find out which libraries and packages are included in them.
You can also check out this tool helping with DBR upgrades.

What are cluster policies and how to use them?

Cluster policies are rules and definitions that can be imposed by your workspace administrator to limit your users’ ability to create compute clusters. It is primarily used as a governance measure to ensure consistency, enforce tagging, and control costs.

For example, if you wish to limit your users to creating only single-node, single-user clusters, you can do so by setting up cluster policies accordingly. You can mandate custom tagging to be applied to the cluster to facilitate the tracking of costs across projects, teams, and departments. These tags can then be used to aggregate Databricks and cloud provider costs for efficient cross-charging within large organizations.

Cluster policies also provide a reusable template for compute configurations that can be shared within and across teams. This way, you ensure that all the necessary dependencies are captured in the template and that the entire team works with the same compute configuration.

To learn how to set up your cluster policies, follow the documentation here, and to learn about cluster policy best practices, refer to the documentation here.

Cluster sizing

Start with Single Node

When you are starting off your MLOps journey on Databricks, the most common scenarios are the following:

Your workload uses little data (fits in memory).
Your code is non-distributed; for example, it uses pure Python, vanilla Pandas, or machine learning libraries such as sklearn.

If any of these conditions apply to you, then a single node cluster provides all of the capabilities you will need without incurring the high cost of a cluster with many worker nodes. Using a single-node cluster is just like using a personal VM. You can size it like a personal laptop or size it to a larger machine if you're working with GBs of data. In addition, you can read and write from UC tables and volumes.

When we are choosing a cluster, even when it is a single node, there are multiple aspects that you should consider. CPU, disk, and memory are parameters that you can experiment with and make decisions on when looking for the best configuration for your workload.

CPU

The optimal number of CPU cores depends on your specific use case, workload characteristics, budget constraints, and the scalability requirements of your application.

By default, pure Python code, Pandas, or sklearn algorithms run on a single CPU core. However, there are mechanisms to leverage multiple cores for such tasks.

For example,

Python’s built-in threading and multiprocessing libraries can be used to parallelise the execution of arbitrary Python code.
For more complex code, you might want to look at parallel execution using Ray, configured to run as a ‘local’ cluster.
To parallelise Pandas on multiple cores of a single machine, you can use libraries such as Modin (Pandas on Ray) and Dask.
Sklearn provides mechanisms to parallelise some of its operations as well.

Here is our recommendation:

If you are not thinking about parallelisation, the number of CPU cores shouldn’t be a concern to you. You can start with a small number of cores, for example, 4.
If you want to parallelise your code, take your budget into consideration. Choose a balance between performance and cost that aligns with your budget constraints. More CPUs might mean faster code execution, but the speed of execution does not scale linearly with the number of cores due to parallelisation overhead.
If you are really looking to optimize for speed and scalability in your code, then you read our guide (to be published soon) to incorporate Spark into data science and machine learning workflows.

Disk

A VM's disk in the cloud is primarily focused on hosting the operating system, software, and runtime environment of the virtual machine. It serves immediate, dynamic, and temporary storage needs associated with the VM's operation.

Due to this, it can often be difficult to estimate how much disk you would need to run your workload. Fortunately, cloud providers allow you to have autoscaling on your local disk.

On AWS and GCP, you can enable it by choosing “Enable autoscaling local storage” in the cluster config.
On Azure, it is enabled by default.

Memory

Selecting the right memory size for a virtual machine involves careful analysis of the operating system, applications, workload characteristics, scalability needs, and budget considerations. When you choose a VM with a certain memory size, a good portion of that memory is allocated to the OS and running applications on it. As a result, choosing a memory of, for example, 14GB does not mean you have 14GB of memory to hold the data you are processing.

To determine how much memory you have available, navigate to the cluster’s page > Metrics and check out the Memory utilization graph. In the example below, you can see that out of a total of 14GB VM memory, there is about 5.28GB free after the OS takes the memory it requires.

Additionally, allocating VMs with excessive memory can be problematic due to the increased cost of garbage collection (GC) for large heaps.

Another factor to take into consideration when choosing memory is the fact that the data size on disk could be different than the data size in memory. Factors such as compression, serialization, additional structures, data representation, and processing overhead might contribute to this discrepancy in one way or another.

How can I find out the size of my data in memory?

To determine what memory size you need for your data processing, read a portion of the data, for example, using Pandas, and check out the memory usage in the Metrics tab. You can then extrapolate how much data is going to be needed for the entire dataset.

Once you have identified CPU, disk, and memory requirements for your workload, you can scan through available VM types on your cloud provider and choose the best one accordingly.

Scale if necessary

In other scenarios, when you need to leverage Spark’s distribution for distributed inference or training, etc., the question that often comes up in the context of choosing a cluster is “What cluster size should I choose?” On a single-node cluster, Spark can still leverage multiple CPUs to parallelise tasks, but in most scenarios, and as your data gets bigger, you are looking into sizing a multi-node cluster. When choosing a cluster, there are multiple aspects that you should consider. CPU, disk, and memory of your cluster are parameters that you can experiment with and make decisions on when looking for the best configuration for your workload.

Stay tuned for the upcoming "Advanced Cluster Configuration for MLOps" blog for more details.

Summary

This article is a beginner's guide to cluster configuration for MLOps using Databricks, detailing how to choose the right type of compute cluster, create clusters, manage access, set policies, size clusters, and select runtimes.

We recommend using interactive clusters for development and job clusters for automated tasks to control costs. We discuss the importance of selecting the right access mode, with options for single-user or shared clusters, and the significance of cluster policies for managing resources and costs. We also cover how to size a cluster based on CPU, disk, and memory needs, as well as the importance of selecting the right Databricks Runtime, highlighting the differences between Standard and ML runtimes and the need to stay updated with the latest versions.

Check out the official Databricks Cluster Configuration Best Practices documentation (AWS | Azure | GCP) and general Databricks Compute documentation (AWS | Azure | GCP) for further advice. Basics of Databricks Workflows - Part 3: The Compute Spectrum also provides a great overview of compute resources on Databricks with a focus on Data Engineering.

Coming up next!

Next blog in this series: MLOps Gym - Databricks Feature Store