Databricks Community

alan_mazan · ‎01-18-2024

Environment isolation is a critical practice in all of software engineering, essential for ensuring application stability, security, and efficient development. By separating environments into CI/CD stages like development, testing and production, disruptions in live applications due to development activities are prevented and conflicts from different software dependencies are avoided. This separation is crucial for a predictable and controlled development process, especially in complex software projects. Additionally, environment isolation enhances governance and security by protecting from unauthorized access and allowing the implementation of “least privilege principle” access control onto assets such as data, code or the execution control.

Due to its multifaceted nature proper environment isolation is a complex topic for any software or data platform. This blog post will be exploring the multiple dimensions of how environment isolation works on the Databricks Data Intelligence Platform and will offer some best practice for customers to follow.

It is assumed that the reader is familiar with the basic concepts of Databricks like Workspaces, Tiering, Clusters, Notebooks, Databricks File System, Unity Catalog and Delta Sharing as well as cloud concepts such as virtual machines and cloud storage.

Overview

Environment isolation on Databricks happens via multiple dimensions. At the top level, environments are isolated by Databricks accounts, followed by “between-workspace isolation”, followed by “within-workspace isolation”.

Databricks Account Overview

Databricks Account Isolation

A Databricks account holds a collection of Databricks workspaces, corresponding account level configurations (e.g. feature enablement) one Unity Catalog metastore per cloud region and a single IdP integration (e.g. SSO + SCIM via AAD). Unity Catalog manages all data access permissions (except for DBFS access) and is physically located on a cloud storage bucket.

Databricks Workspace (physical)

There are two fundamental differences between Azure and AWS:

On Azure there is a 1:1 mapping between Databricks account and AAD tenant. On AWS the relation is more relaxed i.e. one Databricks account can map onto multiple AWS accounts.
On Azure there is no limit on how many Databricks workspaces can be created. On AWS there is a limit of 50 workspaces (for Enterprise tier).

Relation Account <> Workspace for AWS and Azure

Usually customers use one Databricks account per data platform. Reasons to use multiple accounts could be:

Very strict isolation requirements for CI/CD stages (dev, staging, prod) usually in regulated industries.
Hitting the AWS workspace limit when following a “one workspace per team” architecture (see next section).
Multi-cloud architectures - Running workspaces on multiple cloud providers requires at least one Databricks account per cloud provider.

The following consequences arise:

Account level configurations can be different but must be maintained multiple times.
Billing is handled separately.
The Unity Catalog metastore within a cloud region is not shared between workspaces of different accounts. Therefore metadata and managed tables will have to live in different storage locations. Sharing data between workspaces of different accounts must be managed via Delta Sharing which is a higher administrative overhead. The usage of Delta Sharing can also be prohibited by administrators. These can be a desired property in some cases.

Although account level isolation by nature provides strict data and metadata isolation, it is not necessary to isolate via accounts to achieve it. One can specify different physical storage locations for managed tables per catalog of one metastore.

Physical Catalog Isolation

Also each catalog can be assigned to be accessible from specific workspaces only.

Catalog <> Workspace Binding

Between Workspace Isolation

A Databricks workspace consists of:

Various compute instances i.e. all-purpose clusters, job clusters (attached to jobs), SQL warehouses and model serving endpoints.
A Databricks File System.
A Databricks Workspace File System.
A set of assets created within the workspace such as code (notebooks, scripts), jobs, MLflow experiments, models, dashboards, files in DBFS, etc.
A set of users & groups with access permission to the workspace.
A set of workspace level administrative configurations such as:

Unity Catalog being enabled and assigned to a metastore (or not).
Users-specific configurations such as users having “unrestricted cluster creation” permissions or cluster policy configurations.

A tier with respective features (Standard, Premium, Enterprise) and add-ons e.g. Enhanced Security and Compliance.

Customers initially often think of workspaces being the layer of isolation to separate teams (per use case or BU) and CI/CD stages. This is a typical setup with native cloud architectures. Generally, customers can continue to follow this approach with Databricks.

Multi-Workspace Architecture

There are the following trade-offs to consider:

Higher isolation between groups: Users cannot share assets by accident with members of other groups not intended to have access.
Higher data security: Similar to the previous point, data cannot be shared by accident with members of other groups. This can happen for instance through a user:

Writing intermediary results to shared DBFS.
A user being part of two groups, each group having its own catalog. By accident, the user could read data from one group's catalog and write it to the another group's catalog. Between workspace isolation can prevent this if catalog-workspace binding is enabled.

Harder to hit workspace limits: Workspaces and underlying cloud infrastructure have max limits on some resources. For instance, a VNet has a max size of concurrent IP addresses, a cloud account has quotas on max concurrent CPUs, Databricks workspaces have limits on max jobs and max concurrent running jobs etc. Between workspace group isolation (assuming that cloud infrastructure is not shared between workspaces) makes it harder to hit these limits. In practice, this only becomes a problem for very large data platforms though.
Lower collaboration between groups: A necessary consequence is that groups cannot easily collaborate by sharing assets (such as notebooks) with each other.
Hard to manage and maintain: Manually creating a workspace for each new team and managing configurations of all workspaces is impractical. Platform teams must employ a very high level of automation via IaaC.
Workspace limit per account on AWS: The number of workspaces can easily grow beyond the 50 workspace limit.
Costs: Higher costs for additional networking resources.

Within Workspace Isolation

Generally, users and groups can be separated via ACL and compute isolation mechanisms within a Databricks workspace. Creators of objects are their owners and must grant other users access to these. If objects are hierarchical there ACLs propaget down. There are different levels of access such as “Can Manage” or “Can View”. No access usually means no visibility of existence. Access can be granted to users individually or based on group membership. Users and groups are usually managed in the customer's IdP and federated down to the Databricks account level. On the account level, they are distributed across workspaces.

SCIM Federation

There are following specific access control and isolation mechanisms:

Databricks Workspace File Systems (DWFS) ACL: DWFS is a hierarchical namespace for some specific assets within a Databricks workspace: Notebooks, scripts, DBSQL queries & dashboards, MLflow experiments. These assets are grouped in directories and each directory in the namespace is access controlled. There are assets (like jobs or DLT pipelines) that are not governed via DWFS.
Asset level ACL: Each asset, irrespective of whether it is an object in DWFS or not, is also access controlled individually. For objects that live in DWFS (like notebooks) permissions are inherited onto the individual level.
Unity Catalog exposes a three-level hierarchical namespace to access control data and ML models: <catalog>.<schema><table_or_volume_or_ucmodel>. Catalogs can only be created by metastore admins. Within a catalog respective object creation or object access rights must be granted to users.
Compute level ACL & isolation: Users cannot create compute resources unless they have access to a policy or “Allow unrestricted cluster creation” has been granted. Admins can create policies and compute resources, and give usage permissions to users and groups. Compute resources can be defined as “shared” or “individual”. Different compute instances can run with different runtimes and dependencies. Within the scope of a given Databricks notebook that is attached to a cluster new dependencies can be installed or existing overwritten. Changes to dependencies within the scope of a notebook do not propagate to other users, notebooks, or the shared cluster level.
External Locations and Storage Credentials: For accessing cloud storage containers External Locations and Storage Credentials must be configured and ACL granted.

Note the following exceptions:

Databricks File System is not access controlled and isolated within a workspace. This can lead to security issues between groups. Access can be disabled on “Shared” clusters.
Admins have full access and visibility into a Databricks workspace.

Best Practices

Unless there are good reasons in your particular case that speak against the following we recommend:

One Databricks account per data platform per customer. Between-workspace and within-workspace separation usually provides enough isolation guarantees while this architecture is much simpler to build and maintain (especially on Azure).

Separating groups within a workspace.

For each group there should be three catalogs, (one for each CI/CD stage). Within a catalog, a group should organise governance for itself i.e. every group member having equal rights or a group leader implementing more fine-grained access control.
Cluster creation should be restricted using Databricks cluster policies. Policies should enforce tags for group-based cost attribution. There can also be an additional “cross-group shared all-purpose clusters” and a shared SQL warehouse created by admins.

Separating CICD stages using different workspaces with their own isolated underlying cloud infrastructure (except for UC metastores).

Workspaces and Catalogs should be created and maintained using IaaC principles and our Databricks (or AzureRM on Azure) Terraform provider. Optionally there might be additional workspaces for e.g.:
- Additional cloud regions.
- Highly sensitive workloads (e.g. regulated workloads, processing PII data) that require special configurations (e.g. enhanced security and compliance, Delta Sharing disabled) and special access control.
- Sandboxes for exploring functionality in a less restricted setting, not meant to host any production data or workloads.
- Migrations to new workspaces e.g. migrating to UC enabled workspaces.
- Backups.

Each group should have their own Git repo(s) to version their code. Each user should integrate the repo via Databricks Repos to their personal directory in the Workspace for development, essentially creating an own clone. A branching strategy that isolates the CI/CD stages (such as git flow) should be employed.
Automated Jobs or Delta Live Table pipelines should be deployed and run as Service Principals and go through the different stages of a CI/CD process to be reviewed, tested, and deployed to the prod environment. For deployment, we recommend Databricks Asset Bundles (DABs). DABs create isolated deployments in a workspace by syncing code to a users directory in DWFS and object ownership on assets (jobs, DLT pipelines).
Stage Stage-specific configuration can be employed.

For the application code via config files.
For the job / DLT definition via DABs “targets” feature.

Conclusion

Environment isolation is important in building data platforms. The Databricks Data Intelligence Platform is an enterprise grade solution that offers multiple mechanisms for environment isolation. There are best practices that will work for many customers but they are not always the right fit for everyone. Therefore, it is important to understand the isolation mechanisms of the platform when designing an architecture.

You can explore more information in our Databricks Academy. You can also reach out to your Databricks account team or one of our Partners.

Databricks Community

Isolation of Environments on the Databricks Data Intelligence Platform

Table of Contents

Overview

Databricks Account Isolation

Between Workspace Isolation

Within Workspace Isolation

Best Practices

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks