cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
zach_maddigan
Databricks Employee
Databricks Employee

 

zach_maddigan_0-1749749571871.png

Data security has always been one of the top priorities of organizations, but is often one of the final conversations that happens when considering an environment or evaluating a new platform. In today’s world, there is an ever-increasing number of cyber threats, securing an environment should always be at the forefront of infrastructure ensuring that your cloud-based data platforms remain secure is essential. One of the fundamental frameworks for securing such environments is Zero Trust Security. This security model assumes that both internal and external networks are equally untrusted and that all access requests must be verified, regardless of their origin.

In this guide, we’ll explore how to implement Zero Trust principles within Databricks on Azure, focusing on critical components like multi-factor authentication (MFA), conditional access, fine-grained access controls, and the enforcement of least-privilege access. 

Today, organizations need a new security model that effectively adapts to the complexity of the modern environment, embraces the mobile workforce, and protects people, devices, applications, and data wherever they are located.

Understanding Zero Trust Principles

What is Zero Trust?

Zero Trust is a security model that is rooted on the principle of “never trust, always verify”. It assumes breach and verifies each request as though it originated from an uncontrolled network. Trust should never be implicit, and all access to systems, networks, and data must be continuously verified to ensure the right people are accessing the right levels of information at the right time. 

Rather than assuming that users within a trusted network are safe, Zero Trust demands constant validation of all users and devices attempting to access resources, whether they are inside or outside the network perimeter.

Key Principles of Zero Trust

The core principles of Zero Trust revolve around the following:

  1. Verify Explicitly: Always, and continuously authenticate and authorize users and devices before granting access to any resource.
  2. Least-Privilege Access: Provide users with the minimal level of access they need to perform their tasks. Limit user access with Just-in-Time and Just-Enough-Access, risk-based adaptive policies, and data protection
  3. Assume Breach: Always operate under the assumption that a breach has already occurred. Implement security measures accordingly, including monitoring and rapid detection.
  4. Segmentation: Divide networks and resources into smaller, isolated segments to reduce the impact of potential breaches.

By implementing these principles, organizations can ensure that only authorized users and devices can access sensitive data and resources, and any abnormal behavior is detected and flagged in real time.

Applying Zero Trust to Databricks in Azure Environments

As a cloud-first platform Azure provides a robust set of tools and features that can help enforce Zero Trust policies effectively. Databricks, as a first-party product in Azure, natively can integrate with these services to provide comprehensive data protection by applying Zero Trust principles. 

In this section we will learn how to apply these principles and introduce Databricks security controls to apply. For more information about Databricks comprehensive security check out the Security and Trust center

Verify Explicity:

One of the foundational elements of Zero Trust is verifying the identity of users and devices. Azure Entra ID provides a suite of tools to implement multi-factor authentication (MFA) and conditional access policies, which are essential for securing access to Databricks. Databricks sits within Azure’s ecosystem, and in doing so - end users can access Databricks through single-sign on backed by Entra ID. 

  • Single sign-on using Microsoft Entra ID
    • Azure Databricks account and workspaces by default leverage single sign-on in the form of Microsoft Entra ID-backed login.
  • Multi-Factor Authentication (MFA)
    • MFA requires users to authenticate using more than just a password. It adds a second layer of security, usually a time-based one-time passcode (TOTP), a phone call, or a push notification from an authentication app. By enforcing MFA, you can ensure that even if a user’s password is compromised, the attacker will still need a second factor to gain access.
  • Implementing Conditional Access Policies
    • Conditional access is the next step in securing Databricks within Azure. With conditional access, you can enforce specific access policies based on various conditions, such as user location, device compliance, and risk levels. Essentially determining administrators to control where and when users are permitted to sign into Databricks. 
    • In Azure, you create the access policy, define the conditions such as, geographic location, device compliance or application risk levels. Lastly, apply specific controls like requiring MFA, blocking access or granting access based on the conditions and compliance within the previously defined security standards. 
  • Sync users and groups from Microsoft Entra ID
    • Users and groups can automatically be synced from Microsoft Entra ID to your Azure Databricks account using System for Cross-domain Identity Management(SCIM). SCIM is an open standard that allows automated user provisioning and enables a consistent onboarding and offboarding process. 
    • Built on Entra ID to create users and groups in Azure Databricks and give them the proper level of access. When a user leaves your organization or no longer needs access to Azure Databricks, admins can remove the user from Microsoft Entra ID, and that user is deactivated in Azure Databricks. This prevents unauthorized users from accessing sensitive data. 
  • Secure API authentication with OAuth
    • Databricks OAuth supports secure credentials and access for resources and operations at the Azure Databricks workspace level and supports fine-grained permissions for authorization. Databricks also supports personal access tokens (PATs) as well. 

By combining MFA with conditional access, all backed by Microsoft’s EntraID you add another layer of security that aligns to verifying explicitly within Zero Trust principles.

Least-Priveledge Access

Once users have been authenticated, the next step in a Zero Trust approach is ensuring that they only have access to the resources they absolutely need. In Databricks, this can be achieved through fine-grained access control over various components like workspaces, clusters, and notebooks. In Azure Databricks, there are different access control systems for different securable objects. 

Databricks Access Control Model

Configuring Workspace Permissions

Databricks provides role-based access control (RBAC) and fine-grained permissions for managing access to resources. By configuring these controls, you can implement Zero Trust by ensuring that only the appropriate individuals have access to critical resources. At the workspace level, assign roles such as "Admin," "User," and "Viewer". Then grant specific permissions based on tasks, such as the ability to create notebooks, run jobs, or manage clusters.

Configuring Cluster Permissions

Cluster permissions in Databricks can be configured to ensure that only authorized users can spin up clusters or interact with them. This is particularly important in a Zero Trust environment to avoid the misuse of cloud resources. Define which users or groups have access to create, manage, or modify clusters through Cluster ACLs (Access Control Lists). Ensure that clusters run with specific instance profiles that enforce the minimum privilege needed for tasks like data processing with Instance Profile Management. 

Configuring Notebook Permissions

Notebooks often contain sensitive information, such as queries, data processing code, or even credentials. Fine-grained notebook permissions help restrict access. Assign permissions at the notebook level, ensuring that only authorized users can read or modify sensitive notebooks. Databricks provides audit logs that track who accessed a notebook and when, helping maintain transparency and track any potential malicious activity

Overview:

Securable Object

Access Control System

Workspace-level securable objects

Access Control Lists 

Account-level securable objects

Account role based access control (RBAC) 

Data securable objects

Unity Catalog

By carefully configuring these access controls, you can create a secure environment where users only have access to the resources they need, significantly reducing the risk of unauthorized access.

For an overview of setting up and managing user permissions, go to Manage Users on Azure Databricks

Securing Data via Segmentation

Segmenting levels of access via Microsoft EntraID at the Account, Subscriptions, Resource Groups and Resources levels for Databricks compute and data storage is a great first step. For making use of the data inside Databricks, we rely on Unity Catalog as the unified governance solution for data and AI assets, enabling secure access and sharing across clouds and platforms, while providing a central place to manage permissions and audit data access across multiple workspaces. 

The Databricks account is linked to the Microsoft Entra tenant ID. Any subscription inside the tenant is mapped to the same Databricks account. Consequently, any workspace in any subscription and resource group will be able to access the same metastore in the respective regions if admins allow this.

zach_maddigan_0-1749749161910.png

 

In Azure Databricks, a managed identity is used as a storage credential to authenticate with Unity Catalog when accessing ADLS Gen2. Administrators map storage credentials to specific storage accounts (like ADLS Gen2) using external locations, allowing secure access and easier management.

This simplifies administration by enabling users to manage access and data without needing deep cloud-specific knowledge. 

Unity Catalog further segments protection of data assets through a three-level namespace that can be used to represent a logical layout of how data is  accessible and managed in your organization. The three levels are catalogs, schemas and assets. 

Catalogs represent the top level of the structure. They can be made available for selective workspaces, which  allows different operating units to enforce that their data is only available within their environments. This also  allows teams to enforce the availability of data across software development lifecycle scopes (e.g.,  prod data in prod, dev data in dev). 

Schemas are defined inside the catalogs. They can serve as a grouping for per-domain data assets. Assets are inside schemas. They include the following objects: 

  • Data assets: Tables, views, materialized views, streaming tables 
  • Machine learning models 
  • Unity Catalog Volumes: These are logically assigned to a specific catalog and schema and represent  folders with files (e.g., images or documents) 
  • Functions: Reusable procedures that can be written in SQL and Python 

For a detailed understanding of how Unity Catalog provides enterprise governance on Databricks, download the UC Data Governance Architecture Patterns Ebook

Databricks Systems Tables

Databricks provides built-in systems tables which capture metadata information for lakehouse observability and ensure compliance. Examples include which users have access to which data objects; billing tables that provide pricing and usage; compute tables that take cluster usage and warehouse events into consideration; and lineage information between columns and tables

These underlying tables can be queried through SQL or activity dashboards to provide observability about every asset within the Databricks Intelligence Platform. 

Specific tables that help implement zero trust architecture for your lakehouse include

Audit tables:

  • Includes information on a wide variety of UC events. UC captures an audit log of actions performed against the metastore giving administrators access to details about who accessed a given dataset and the actions that they performed.

For Example, to query which tables did a user access recently?: 

zach_maddigan_1-1749749161957.png

 

Table lineage and column lineage tables: 

  • They allow you to programmatically query lineage data to fuel decision making and reports. Table lineage records each read-and-write event on a UC table or path that might include job runs, notebook runs and dashboards associated with the table. For column lineage, data is captured by reading the column

Query history holds information on all SQL commands, i/o performance, and number of rows returned.

Conclusion

Implementing Zero Trust Security within Databricks on Azure is an essential step toward protecting sensitive data and maintaining a secure environment in the cloud. By following Zero Trust principles like explicit verification, least-privilege access, and assumption of breach, organizations can significantly reduce the risk of unauthorized access. 

As a best practice, when implementing Azure Databricks, administrators should integrate with Azure Security center to provide additional security monitoring, threat protection and compliance tracking as well as constantly audit user activity. 

By applying these practices, you’ll be well on your way to building a secure, Zero Trust-compliant Databricks environment on Azure that safeguards your data and resources effectively