Databricks Community

HariSelvarajan · ‎02-14-2024

Welcome to the third installment of our blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Previously, we explained the basics of creating your first pipeline as well as best practices for configuring and monitoring it. In this blog, we'll focus on Databricks compute options and share guidelines for configuring your resources based on workload.

Databricks Compute

Whether a single VM or an Apache Spark cluster, Databricks Compute refers to the set of resources configured and managed by Databricks to execute a variety of tasks (e.g., Delta Live Tables, DBT, streaming, ETL, ML). There are three main configurations to choose from when defining compute resources for a specific workload: Cluster Type, Access Mode, and Databricks Runtime (DBR).

Compute types

All-Purpose Compute

All-purpose Compute is optimal for development, collaboration and interactive analysis. It can be created via the user interface (UI), command-line interface (CLI), or REST API. These clusters cater to collaborative tasks such as Exploratory Data Analysis (EDA), pipeline development, and more. With the ability to manually terminate and restart as needed, multiple users can seamlessly share these clusters. Although interactive compute supports scheduled Workflows, Databricks recommends utilising Jobs Compute for this purpose.

Job Compute

Databricks Job Compute is recommended for orchestrating production and repeated workloads, as it provides better resource isolation and cost benefits. The compute resources are dynamically created by the Workflow scheduler during Workflow execution and immediately terminated upon completion. Unlike All-Purpose Compute, users cannot manually restart a Job Compute resource.

SQL Warehouses

Databricks SQL (DBSQL) Warehouses run on a purpose-built engine, finely tuned for optimal SQL analytics performance. It includes key performance features such as Photon (explained in detail in subsequent sections), Predictive I/O, and Intelligent Workload Management (IWM). Exclusive to DBSQL Serverless, IWM employs AI-powered prediction and dynamic resource management to dynamically and efficiently allocate resources.

When creating SQL tasks (queries, alerts, dashboards, or SQL files) in Workflows select a SQL warehouse as your compute. This can be of the following types:

Classic Warehouse:

Limited Databricks SQL functionality
Basic performance features

Pro Warehouse:

Supports all Databricks SQL functionality
Higher performance compared to Classic
Introduces features like query federation, workflow integration, and advanced support for data science and ML functions

Serverless Warehouse:

Advanced performance features (such as Intelligent workload management)
Supports all Pro-type features
Offers instant and fully managed compute resources

You can read more about DBSQL here.

DLT Compute

Delta Live Table pipelines operate on a dedicated, specialized cluster optimized for the requirements of DLT workloads. These are distinct from the more general-purpose All-Purpose or Job Compute. While creating a DLT pipeline, users can define the Cluster Scaling Mode (Fixed, Legacy and Enhanced) which controls how your cluster dynamically scales and responds to changes in demand. DLT will manage and optimize the node type and DBR selection, ensuring the best choice of nodes and the latest DBR runtime are selected, reducing the management overhead for users.

Here's a brief side-by-side analysis of these three cluster types:

Type	All-Purpose Cluster	Jobs Cluster	SQL Warehouse	DLT Compute
Persistence	Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed	The ephemeral cluster created for the job terminated on completion	Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed	The ephemeral cluster created for the job terminated on completion
Workload	Interactive data analytics and EDA	Run Data Engineering, Data Science and BI workloads	Execute SQL queries, alerts and dashboard in interactive and scheduled mode, run dbt models	Streaming Workloads
Use	Development and ad-hoc analysis	Production and repeated workloads	Interactive and production SQL workloads	Production and repeated workloads
Benefits	Collaboration with team members, Ability to restart the cluster	Workload isolation and orchestrated runs	Suitable for BI analytics workload	Out of box features such as Quality metrics, event logs, automatic restarts, autoscaling etc
Cost	Pay for Usage Time	Pay for Usage time Note: Job clusters cost less (approx 50%) for the same amount of run time than all-purpose compute	Pay for usage DBU cost differs based on the type of warehouse	Pay for usage DBU cost differs based on Product Edition

Serverless Compute

Serverless Compute is fully managed by Databricks enabling rapid start-up times and automatic optimisations that adapt to your specific workloads. This means Serverless will process your data in a manner that is both cost and performance efficient. These benefits translate to a lower TCO, better reliability, and an improved user experience.

Serverless compute is an option currently available under:

Databricks SQL (GA)
DLT (Private Preview)
Serverless Workflows (Private Preview)
Serverless Interactive / Notebooks (Private Preview)

In the case of Serverless Workflows and DLT, Databricks chooses the best compute configuration (based on runtime, nodes, and size ) and gives optimum workload execution. Please reach out to the Databricks account team if you would like to use this feature.

Access modes

Compute Access Modes define the permissions and restrictions for cluster usage and data access. It is also a factor in determining if your cluster is enabled for governance features like Unity Catalog. The following access modes are offered by Databricks clusters:

Single user

The Single User access mode is used to run a Workflow under the ownership of a single user. When a Workflow is executed on a Single User Access Mode cluster, it is executed under the identity of the assigned user/service principal. For production jobs, it is recommended to run the job as a service principal.

Some recommended use cases for the Single User Access Mode cluster:

When credential passthrough is needed
Cost isolation based on users
Execution of R workloads on Unity Catalog
Using ML Runtime

Shared

The shared access mode enables multiple users to leverage compute concurrently, fostering a collaborative environment. While the user can run their workload concurrently, the Shared access mode preserves user isolation and improves security.

Whilst Databricks recommends using Shared access mode for most workloads, there are some exceptions - the Databricks Runtime for ML and Spark Machine Learning Library (MLlib) are not supported yet.

For a complete list of limitations of Single-user and Shared access modes, refer to the official documentation.

No isolation shared

In this mode, there is no support for Unity Catalog. While it allows multi-user environments, it doesn’t provide user isolation and is not recommended for working on sensitive data.

Here's a brief side-by-side analysis of access modes:

Access Mode	Single User	Shared	No Isolation Shared
Permission Required to use	Only Assigned User can use	CAN ATTACH	CAN ATTACH
Unity Catalog Support
Access Data In Unity Catalog	Yes	Yes	No
Fine-Grained Access Control (Views, Row Columns Masking)	Will be available soon	Yes	No
Language and APIs
Python/SQL	Yes	Yes	Yes
Scala	Yes	In Preview (at the time of writing)	Yes
DataFrame API, Streaming API, Single-Node ML	Yes	In Preview (at the time of writing)	Yes
R, Distributed ML, GPU, RDD API	Yes	No	Yes
Other Features
Init Scripts/ Cluster Libraries	Yes	In Preview (at the time of writing)	Yes

Databricks Runtime (DBR)

Databricks Runtime encompasses the collection of software components deployed on Databricks Compute. Along with Apache Spark™, it includes various libraries (e.g., Delta, MLflow) and components that substantially improve the usability, performance, and security of running workloads. Depending on the type of workloads there are different Databricks runtimes available. The general recommendation is to always use the latest LTS (Long Term Support) version of each runtime, as it comes with the latest improvements and 3 years of support.

Standard

The Standard Runtime is ideal for diverse data engineering and streaming workloads, featuring Apache Spark and key components like Delta, Pandas etc. Also, It comes pre-installed with Python, R, and Scala libraries, providing a comprehensive environment for efficient data processing and analytics.

Machine Learning

In addition to the libraries included in the standard runtime, the ML Runtime adds popular machine learning libraries like MLflow, TensorFlow, Keras, PyTorch, and XGBoost. There is also an additional option to choose GPU-enabled runtime specifically for deep learning and generative AI workloads.

Photon

Photon is a specialised runtime of Apache Spark, where the execution engine is written entirely using C++ from the ground up. It provides many-fold performance improvements (up to 80% TCO cost savings over traditional DBR and up to 85% reduction in VM compute hours) over standard Spark engines. It is available only under the Standard Databricks runtime and is more beneficial for batch-processing workloads. There is an added cost in DBU usage while using photon-enabled compute. Photon is enabled by default on all DBSQL, DLT and Serverless compute.

Additional Compute Configurations

Beyond what has been discussed so far, users have additional options to optimise and configure compute resources.

Cluster Policies

Cluster Policies can enforce or suggest configurations when creating a cluster to help admins achieve:

Cost control & allocation to BUs
Enforce security and compliance
Ensure stability & standardization of clusters

As there are many options to choose from while creating a cluster, admins can define policies to limit user choices while creating clusters. The policies can cover DBR selection, nodes selection, max worker count, specific Spark properties (e.g. external hive metastore), and more.

Here’s an example of cluster policy:

You can also refer to this blog for more detail on how cluster policies can manage your cost of compute.

Compute pools

A compute pool is a collection of readily available idle instances, designed to reduce start-up and autoscaling delays. Users have the capability to establish a pool that is utilized during the initiation of a cluster. Upon termination of the cluster, instances are returned back into the pool for future deployment. While cluster pools are beneficial for running workloads on classic compute, Databricks recommends using Serverless where possible.

The following example illustrates key metrics for a cluster pool, including Total Instances, Min Idle, Max Capacity, and Total Used:

Some of the benefits of using pools include:

Less instance acquisition time because VMs are pre-configured with DBR
This prevents short jobs from wasting more time starting than running
Databricks does not charge DBUs for idle instances (cloud provider costs still apply)

Autoscaling

Databricks Autoscaling is designed to maximise cluster efficiency by dynamically allocating resources in response to workload fluctuations. This setting is more suitable for workloads with varying degrees of compute requirement. Users can provide a minimum and maximum number of nodes and resources will be allocated based on workload requirements. In the screenshot below, a cluster is formed with a minimum of 2 instances and a maximum of 8 nodes:

For DLT pipelines, there is also an Enhanced Auto Scaling mode available which has autoscaling support for streaming workloads, and additional enhancements to improve the performance of batch workloads.

Spot Instances

Spot instances help reduce compute costs for non-mission critical workloads. Using spot pricing for cloud resources, users can access unused capacity at deep discounts. Databricks automatically handles the termination of spot VMs by starting new pay-as-you-go worker nodes and guarantees reliability through job completion. This provides predictability while helping to lower costs.

Other Configs

In addition to the previously mentioned configurations, users can also specify the following:

Spark Config Properties
Users can fine-tune Apache Spark settings, adjusting parameters like memory allocation and parallelism for optimal data processing.

Assign Tags to the Clusters
Users can categorize and manage resources efficiently by assigning tags to clusters, providing valuable metadata for organization and tracking.

Init Scripts for Cluster Initialization
Users can provide custom initialization scripts to execute upon cluster startup for automating tasks such as dependency installation and environment setup

Conclusion

Databricks Compute offers powerful and flexible solutions for handling complex data processing and analytics tasks. With the ability to efficiently and automatically scale, and different configuration options available for each type of workload, Databricks provides a robust framework for optimising computational resources. Whether you're dealing with large-scale data processing or dynamic workloads, Databricks Compute empowers data teams to derive meaningful insights and drive innovation in a data-driven landscape.