cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
cancel
Showing results for 
Search instead for 
Did you mean: 
HariSelvarajan
New Contributor III
New Contributor III

By Hari Selvarajan & Sourav Gulati

Welcome to the third installment of our blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. Previously, we explained the basics of creating your first pipeline as well as best practices for configuring and monitoring it. In this blog, we'll focus on Databricks compute options and share guidelines for configuring your resources based on workload.

Databricks Compute

Whether a single VM or an Apache Spark cluster, Databricks Compute refers to the set of resources configured and managed by Databricks to execute a variety of tasks (e.g., Delta Live Tables, DBT, streaming, ETL, ML). There are three main configurations to choose from when defining compute resources for a specific workload: Cluster Type, Access Mode, and Databricks Runtime (DBR).

HariSelvarajan_0-1707865066018.png

Compute types

All-Purpose Compute

All-purpose Compute is optimal for development, collaboration and interactive analysis.  It can be created via the user interface (UI), command-line interface (CLI), or REST API. These clusters cater to collaborative tasks such as Exploratory Data Analysis (EDA), pipeline development, and more. With the ability to manually terminate and restart as needed, multiple users can seamlessly share these clusters. Although interactive compute supports scheduled Workflows, Databricks recommends utilising Jobs Compute for this purpose.

Job Compute

Databricks Job Compute is recommended for orchestrating production and repeated workloads, as it provides better resource isolation and cost benefits. The compute resources are dynamically created by the Workflow scheduler during Workflow execution and immediately terminated upon completion. Unlike All-Purpose Compute, users cannot manually restart a Job Compute resource. 

SQL Warehouses 

Databricks SQL (DBSQL) Warehouses run on a purpose-built engine, finely tuned for optimal SQL analytics performance. It includes key performance features such as Photon (explained in detail in subsequent sections), Predictive I/O, and Intelligent Workload Management (IWM). Exclusive to DBSQL Serverless, IWM employs AI-powered prediction and dynamic resource management to dynamically and efficiently allocate resources. 

When creating SQL tasks (queries, alerts, dashboards, or SQL files) in Workflows select a SQL warehouse as your compute. This can be of the following types:

  • Classic Warehouse:
    • Limited Databricks SQL functionality
    • Basic performance features

  • Pro Warehouse:
    • Supports all Databricks SQL functionality
    • Higher performance compared to Classic
    • Introduces features like query federation, workflow integration, and advanced support for data science and ML functions

  • Serverless Warehouse:
    • Advanced performance features (such as Intelligent workload management)
    • Supports all Pro-type features
    • Offers instant and fully managed compute resources

You can read more about DBSQL here.

DLT Compute

Delta Live Table pipelines operate on a dedicated, specialized cluster optimized for the requirements of DLT workloads. These are distinct from the more general-purpose All-Purpose or Job Compute. While creating a DLT pipeline, users can define the Cluster Scaling Mode (Fixed, Legacy and Enhanced) which controls how your cluster dynamically scales and responds to changes in demand. DLT will manage and optimize the node type and DBR selection, ensuring the best choice of nodes and the latest DBR runtime are selected, reducing the management overhead for users. 

Here's a brief side-by-side analysis of these three cluster types:

Type

All-Purpose Cluster

Jobs Cluster

SQL Warehouse

DLT Compute

Persistence 

Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed

The ephemeral cluster created for the job terminated on completion

Persistent cluster; terminates after the defined inactivity threshold and can be restarted when needed

The ephemeral cluster created for the job terminated on completion 

Workload

Interactive data analytics and EDA

Run Data Engineering, Data Science and BI  workloads

Execute SQL queries, alerts and dashboard in interactive and scheduled mode, run dbt models

Streaming Workloads

Use

Development and ad-hoc analysis

Production and repeated workloads

Interactive and production SQL workloads

Production and repeated workloads

Benefits

Collaboration with team members,

Ability to restart the cluster

Workload isolation and orchestrated runs

Suitable for BI analytics workload

Out of box features such as Quality metrics, event logs, automatic restarts, autoscaling etc

Cost 

Pay for Usage Time

Pay for Usage time
Note: Job clusters cost less (approx 50%) for the same amount of run time than all-purpose compute

Pay for usage


DBU cost differs based on the type of warehouse

Pay for usage 


DBU cost differs based on Product Edition



Serverless Compute

Serverless Compute is fully managed by Databricks enabling rapid start-up times and automatic optimisations that adapt to your specific workloads. This means Serverless will process your data in a manner that is both cost and performance efficient. These benefits translate to a lower TCO, better reliability, and an improved user experience.

Serverless compute is an option currently available under:

  • Databricks SQL (GA)
  • DLT (Private Preview)
  • Serverless Workflows (Private Preview)
  • Serverless Interactive / Notebooks (Private Preview)

In the case of Serverless Workflows and DLT, Databricks chooses the best compute configuration (based on runtime, nodes, and size ) and gives optimum workload execution. Please reach out to the Databricks account team if you would like to use this feature.

Access modes

Compute Access Modes define the permissions and restrictions for cluster usage and data access.  It is also a factor in determining if your cluster is enabled for governance features like Unity Catalog. The following access modes are offered by Databricks clusters:

Single user

The Single User access mode is used to run a Workflow under the ownership of a single user. When a Workflow is executed on a Single User Access Mode cluster, it is executed under the identity of the assigned user/service principal. For production jobs, it is recommended to run the job as a service principal.

Some recommended use cases for the Single User Access Mode cluster:

  1. When credential passthrough is needed
  2. Cost isolation based on users
  3. Execution of R workloads on Unity Catalog
  4. Using ML Runtime

Shared

The shared access mode enables multiple users to leverage compute concurrently, fostering a collaborative environment. While the user can run their workload concurrently, the Shared access mode preserves user isolation and improves security.

Whilst Databricks recommends using Shared access mode for most workloads, there are some exceptions - the Databricks Runtime for ML and Spark Machine Learning Library (MLlib) are not supported yet.

For a complete list of limitations of Single-user and Shared access modes, refer to the official documentation.

No isolation shared

In this mode, there is no support for Unity Catalog. While it allows multi-user environments, it doesn’t provide user isolation and is not recommended for working on sensitive data.

Here's a brief side-by-side analysis of access modes:

Access Mode

Single User

Shared

No Isolation Shared

Permission Required to use

Only Assigned User can use

CAN ATTACH

CAN ATTACH

Unity Catalog Support

Access Data In Unity Catalog

Yes 

Yes

No

Fine-Grained Access Control (Views, Row Columns Masking)

Will be available soon

Yes

No

Language and APIs

Python/SQL

Yes

Yes

Yes

Scala

Yes

In Preview (at the time of writing)

Yes

DataFrame API, Streaming API, Single-Node ML

Yes

In Preview (at the time of writing)

Yes

R, Distributed ML, GPU, RDD API

Yes

No

Yes

Other Features

Init Scripts/ Cluster Libraries

Yes

In Preview (at the time of writing)

Yes

Databricks Runtime (DBR)

Databricks Runtime encompasses the collection of software components deployed on Databricks Compute. Along with Apache Spark™, it includes various libraries (e.g., Delta, MLflow) and components that substantially improve the usability, performance, and security of running workloads. Depending on the type of workloads there are different Databricks runtimes available. The general recommendation is to always use the latest LTS (Long Term Support) version of each runtime, as it comes with the latest improvements and 3 years of support.

Standard

The Standard Runtime is ideal for diverse data engineering and streaming workloads, featuring Apache Spark and key components like Delta, Pandas etc. Also, It comes pre-installed with Python, R, and Scala libraries, providing a comprehensive environment for efficient data processing and analytics.

Machine Learning

In addition to the libraries included in the standard runtime, the ML Runtime adds popular machine learning libraries like MLflow, TensorFlow, Keras, PyTorch, and XGBoost. There is also an additional option to choose GPU-enabled runtime specifically for deep learning and generative AI workloads.

Photon

Photon is a specialised runtime of Apache Spark, where the execution engine is written entirely using C++ from the ground up. It provides many-fold performance improvements (up to 80% TCO cost savings over traditional DBR and up to 85% reduction in VM compute hours) over standard Spark engines. It is available only under the Standard Databricks runtime and is more beneficial for batch-processing workloads. There is an added cost in DBU usage while using photon-enabled compute. Photon is enabled by default on all DBSQL, DLT and Serverless compute.

Additional Compute Configurations

Beyond what has been discussed so far, users have additional options to optimise and configure compute resources.

Cluster Policies

Cluster Policies can enforce or suggest configurations when creating a cluster to help admins achieve:

  • Cost control & allocation to BUs
  • Enforce security and compliance
  • Ensure stability & standardization of clusters

As there are many options to choose from while creating a cluster, admins can define policies to limit user choices while creating clusters. The policies can cover DBR selection, nodes selection, max worker count, specific Spark properties (e.g. external hive metastore), and more.

Here’s an example of cluster policy:

HariSelvarajan_1-1707865066052.png

HariSelvarajan_2-1707865065982.png

You can also refer to this blog for more detail on how cluster policies can manage your cost of compute. 

Compute pools

A compute pool is a collection of readily available idle instances, designed to reduce start-up and autoscaling delays. Users have the capability to establish a pool that is utilized during the initiation of a cluster. Upon termination of the cluster, instances are returned back into the pool for future deployment. While cluster pools are beneficial for running workloads on classic compute, Databricks recommends using Serverless where possible.

The following example illustrates key metrics for a cluster pool, including Total Instances, Min Idle, Max Capacity, and Total Used

HariSelvarajan_3-1707865065952.png

Some of the benefits of using pools include:

  • Less instance acquisition time because VMs are pre-configured with DBR
  • This prevents short jobs from wasting more time starting than running
  • Databricks does not charge DBUs for idle instances (cloud provider costs still apply)

Autoscaling

Databricks Autoscaling is designed to maximise cluster efficiency by dynamically allocating resources in response to workload fluctuations. This setting is more suitable for workloads with varying degrees of compute requirement. Users can provide a minimum and maximum number of nodes and resources will be allocated based on workload requirements. In the screenshot below, a cluster is formed with a minimum of 2 instances and a maximum of 8 nodes:

HariSelvarajan_4-1707865065996.png

For DLT pipelines, there is also an Enhanced Auto Scaling mode available which has autoscaling support for streaming workloads, and additional enhancements to improve the performance of batch workloads. 

Spot Instances

Spot instances help reduce compute costs for non-mission critical workloads.  Using spot pricing for cloud resources, users can access unused capacity at deep discounts. Databricks automatically handles the termination of spot VMs by starting new pay-as-you-go worker nodes and guarantees reliability through job completion. This provides predictability while helping to lower costs. 

Other Configs

In addition to the previously mentioned configurations, users can also specify the following:

  • Spark Config Properties
    Users can fine-tune Apache Spark settings, adjusting parameters like memory allocation and parallelism for optimal data processing.
  • Assign Tags to the Clusters
    Users can categorize and manage resources efficiently by assigning tags to clusters, providing valuable metadata for organization and tracking.
  • Init Scripts for Cluster Initialization
    Users can provide custom initialization scripts to execute upon cluster startup for automating tasks such as dependency installation and environment setup

Conclusion

Databricks Compute offers powerful and flexible solutions for handling complex data processing and analytics tasks. With the ability to efficiently and automatically scale, and different configuration options available for each type of workload, Databricks provides a robust framework for optimising computational resources. Whether you're dealing with large-scale data processing or dynamic workloads, Databricks Compute empowers data teams to derive meaningful insights and drive innovation in a data-driven landscape.