By Hari Selvarajan & Sourav Gulati
Welcome to the third installment of our blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Previously, we explained the basics of creating your first pipeline as well as best practices for configuring and monitoring it. In this blog, we'll focus on Databricks compute options and share guidelines for configuring your resources based on workload.
Whether a single VM or an Apache Spark cluster, Databricks Compute refers to the set of resources configured and managed by Databricks to execute a variety of tasks (e.g., Delta Live Tables, DBT, streaming, ETL, ML). There are three main configurations to choose from when defining compute resources for a specific workload: Cluster Type, Access Mode, and Databricks Runtime (DBR).
All-purpose Compute is optimal for development, collaboration and interactive analysis. It can be created via the user interface (UI), command-line interface (CLI), or REST API. These clusters cater to collaborative tasks such as Exploratory Data Analysis (EDA), pipeline development, and more. With the ability to manually terminate and restart as needed, multiple users can seamlessly share these clusters. Although interactive compute supports scheduled Workflows, Databricks recommends utilising Jobs Compute for this purpose.
Databricks Job Compute is recommended for orchestrating production and repeated workloads, as it provides better resource isolation and cost benefits. The compute resources are dynamically created by the Workflow scheduler during Workflow execution and immediately terminated upon completion. Unlike All-Purpose Compute, users cannot manually restart a Job Compute resource.
Databricks SQL (DBSQL) Warehouses run on a purpose-built engine, finely tuned for optimal SQL analytics performance. It includes key performance features such as Photon (explained in detail in subsequent sections), Predictive I/O, and Intelligent Workload Management (IWM). Exclusive to DBSQL Serverless, IWM employs AI-powered prediction and dynamic resource management to dynamically and efficiently allocate resources.
When creating SQL tasks (queries, alerts, dashboards, or SQL files) in Workflows select a SQL warehouse as your compute. This can be of the following types:
You can read more about DBSQL here.
Delta Live Table pipelines operate on a dedicated, specialized cluster optimized for the requirements of DLT workloads. These are distinct from the more general-purpose All-Purpose or Job Compute. While creating a DLT pipeline, users can define the Cluster Scaling Mode (Fixed, Legacy and Enhanced) which controls how your cluster dynamically scales and responds to changes in demand. DLT will manage and optimize the node type and DBR selection, ensuring the best choice of nodes and the latest DBR runtime are selected, reducing the management overhead for users.
Here's a brief side-by-side analysis of these three cluster types:
Type |
All-Purpose Cluster |
Jobs Cluster |
SQL Warehouse |
DLT Compute |
Persistence |
Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed |
The ephemeral cluster created for the job terminated on completion |
Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed |
The ephemeral cluster created for the job terminated on completion |
Workload |
Interactive data analytics and EDA |
Run Data Engineering, Data Science and BI workloads |
Execute SQL queries, alerts and dashboard in interactive and scheduled mode, run dbt models |
Streaming Workloads |
Use |
Development and ad-hoc analysis |
Production and repeated workloads |
Interactive and production SQL workloads |
Production and repeated workloads |
Benefits |
Collaboration with team members, Ability to restart the cluster |
Workload isolation and orchestrated runs |
Suitable for BI analytics workload |
Out of box features such as Quality metrics, event logs, automatic restarts, autoscaling etc |
Cost |
Pay for Usage Time |
Pay for Usage time |
Pay for usage DBU cost differs based on the type of warehouse |
Pay for usage DBU cost differs based on Product Edition |
Serverless Compute is fully managed by Databricks enabling rapid start-up times and automatic optimisations that adapt to your specific workloads. This means Serverless will process your data in a manner that is both cost and performance efficient. These benefits translate to a lower TCO, better reliability, and an improved user experience.
Serverless compute is an option currently available under:
In the case of Serverless Workflows and DLT, Databricks chooses the best compute configuration (based on runtime, nodes, and size ) and gives optimum workload execution. Please reach out to the Databricks account team if you would like to use this feature.
Compute Access Modes define the permissions and restrictions for cluster usage and data access. It is also a factor in determining if your cluster is enabled for governance features like Unity Catalog. The following access modes are offered by Databricks clusters:
The Single User access mode is used to run a Workflow under the ownership of a single user. When a Workflow is executed on a Single User Access Mode cluster, it is executed under the identity of the assigned user/service principal. For production jobs, it is recommended to run the job as a service principal.
Some recommended use cases for the Single User Access Mode cluster:
The shared access mode enables multiple users to leverage compute concurrently, fostering a collaborative environment. While the user can run their workload concurrently, the Shared access mode preserves user isolation and improves security.
Whilst Databricks recommends using Shared access mode for most workloads, there are some exceptions - the Databricks Runtime for ML and Spark Machine Learning Library (MLlib) are not supported yet.
For a complete list of limitations of Single-user and Shared access modes, refer to the official documentation.
In this mode, there is no support for Unity Catalog. While it allows multi-user environments, it doesn’t provide user isolation and is not recommended for working on sensitive data.
Here's a brief side-by-side analysis of access modes:
Access Mode |
Single User |
Shared |
No Isolation Shared |
Permission Required to use |
Only Assigned User can use |
CAN ATTACH |
CAN ATTACH |
Unity Catalog Support |
|||
Access Data In Unity Catalog |
Yes |
Yes |
No |
Fine-Grained Access Control (Views, Row Columns Masking) |
Will be available soon |
Yes |
No |
Language and APIs |
|||
Python/SQL |
Yes |
Yes |
Yes |
Scala |
Yes |
In Preview (at the time of writing) |
Yes |
DataFrame API, Streaming API, Single-Node ML |
Yes |
In Preview (at the time of writing) |
Yes |
R, Distributed ML, GPU, RDD API |
Yes |
No |
Yes |
Other Features |
|||
Init Scripts/ Cluster Libraries |
Yes |
In Preview (at the time of writing) |
Yes |
Databricks Runtime encompasses the collection of software components deployed on Databricks Compute. Along with Apache Spark™, it includes various libraries (e.g., Delta, MLflow) and components that substantially improve the usability, performance, and security of running workloads. Depending on the type of workloads there are different Databricks runtimes available. The general recommendation is to always use the latest LTS (Long Term Support) version of each runtime, as it comes with the latest improvements and 3 years of support.
The Standard Runtime is ideal for diverse data engineering and streaming workloads, featuring Apache Spark and key components like Delta, Pandas etc. Also, It comes pre-installed with Python, R, and Scala libraries, providing a comprehensive environment for efficient data processing and analytics.
In addition to the libraries included in the standard runtime, the ML Runtime adds popular machine learning libraries like MLflow, TensorFlow, Keras, PyTorch, and XGBoost. There is also an additional option to choose GPU-enabled runtime specifically for deep learning and generative AI workloads.
Photon is a specialised runtime of Apache Spark, where the execution engine is written entirely using C++ from the ground up. It provides many-fold performance improvements (up to 80% TCO cost savings over traditional DBR and up to 85% reduction in VM compute hours) over standard Spark engines. It is available only under the Standard Databricks runtime and is more beneficial for batch-processing workloads. There is an added cost in DBU usage while using photon-enabled compute. Photon is enabled by default on all DBSQL, DLT and Serverless compute.
Beyond what has been discussed so far, users have additional options to optimise and configure compute resources.
Cluster Policies can enforce or suggest configurations when creating a cluster to help admins achieve:
As there are many options to choose from while creating a cluster, admins can define policies to limit user choices while creating clusters. The policies can cover DBR selection, nodes selection, max worker count, specific Spark properties (e.g. external hive metastore), and more.
Here’s an example of cluster policy:
You can also refer to this blog for more detail on how cluster policies can manage your cost of compute.
A compute pool is a collection of readily available idle instances, designed to reduce start-up and autoscaling delays. Users have the capability to establish a pool that is utilized during the initiation of a cluster. Upon termination of the cluster, instances are returned back into the pool for future deployment. While cluster pools are beneficial for running workloads on classic compute, Databricks recommends using Serverless where possible.
The following example illustrates key metrics for a cluster pool, including Total Instances, Min Idle, Max Capacity, and Total Used:
Some of the benefits of using pools include:
Databricks Autoscaling is designed to maximise cluster efficiency by dynamically allocating resources in response to workload fluctuations. This setting is more suitable for workloads with varying degrees of compute requirement. Users can provide a minimum and maximum number of nodes and resources will be allocated based on workload requirements. In the screenshot below, a cluster is formed with a minimum of 2 instances and a maximum of 8 nodes:
For DLT pipelines, there is also an Enhanced Auto Scaling mode available which has autoscaling support for streaming workloads, and additional enhancements to improve the performance of batch workloads.
Spot instances help reduce compute costs for non-mission critical workloads. Using spot pricing for cloud resources, users can access unused capacity at deep discounts. Databricks automatically handles the termination of spot VMs by starting new pay-as-you-go worker nodes and guarantees reliability through job completion. This provides predictability while helping to lower costs.
In addition to the previously mentioned configurations, users can also specify the following:
Databricks Compute offers powerful and flexible solutions for handling complex data processing and analytics tasks. With the ability to efficiently and automatically scale, and different configuration options available for each type of workload, Databricks provides a robust framework for optimising computational resources. Whether you're dealing with large-scale data processing or dynamic workloads, Databricks Compute empowers data teams to derive meaningful insights and drive innovation in a data-driven landscape.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.