Data security has always been one of the top priorities of organizations, but is often one of the final conversations that happens when considering an environment or evaluating a new platform. In today’s world, there is an ever-increasing number of cyber threats, securing an environment should always be at the forefront of infrastructure ensuring that your cloud-based data platforms remain secure is essential. One of the fundamental frameworks for securing such environments is Zero Trust Security. This security model assumes that both internal and external networks are equally untrusted and that all access requests must be verified, regardless of their origin.
In this guide, we’ll explore how to implement Zero Trust principles within Databricks on Azure, focusing on critical components like multi-factor authentication (MFA), conditional access, fine-grained access controls, and the enforcement of least-privilege access.
Today, organizations need a new security model that effectively adapts to the complexity of the modern environment, embraces the mobile workforce, and protects people, devices, applications, and data wherever they are located.
Zero Trust is a security model that is rooted on the principle of “never trust, always verify”. It assumes breach and verifies each request as though it originated from an uncontrolled network. Trust should never be implicit, and all access to systems, networks, and data must be continuously verified to ensure the right people are accessing the right levels of information at the right time.
Rather than assuming that users within a trusted network are safe, Zero Trust demands constant validation of all users and devices attempting to access resources, whether they are inside or outside the network perimeter.
The core principles of Zero Trust revolve around the following:
By implementing these principles, organizations can ensure that only authorized users and devices can access sensitive data and resources, and any abnormal behavior is detected and flagged in real time.
As a cloud-first platform Azure provides a robust set of tools and features that can help enforce Zero Trust policies effectively. Databricks, as a first-party product in Azure, natively can integrate with these services to provide comprehensive data protection by applying Zero Trust principles.
In this section we will learn how to apply these principles and introduce Databricks security controls to apply. For more information about Databricks comprehensive security check out the Security and Trust center.
Verify Explicity:
One of the foundational elements of Zero Trust is verifying the identity of users and devices. Azure Entra ID provides a suite of tools to implement multi-factor authentication (MFA) and conditional access policies, which are essential for securing access to Databricks. Databricks sits within Azure’s ecosystem, and in doing so - end users can access Databricks through single-sign on backed by Entra ID.
By combining MFA with conditional access, all backed by Microsoft’s EntraID you add another layer of security that aligns to verifying explicitly within Zero Trust principles.
Once users have been authenticated, the next step in a Zero Trust approach is ensuring that they only have access to the resources they absolutely need. In Databricks, this can be achieved through fine-grained access control over various components like workspaces, clusters, and notebooks. In Azure Databricks, there are different access control systems for different securable objects.
Databricks Access Control Model
Configuring Workspace Permissions
Databricks provides role-based access control (RBAC) and fine-grained permissions for managing access to resources. By configuring these controls, you can implement Zero Trust by ensuring that only the appropriate individuals have access to critical resources. At the workspace level, assign roles such as "Admin," "User," and "Viewer". Then grant specific permissions based on tasks, such as the ability to create notebooks, run jobs, or manage clusters.
Cluster permissions in Databricks can be configured to ensure that only authorized users can spin up clusters or interact with them. This is particularly important in a Zero Trust environment to avoid the misuse of cloud resources. Define which users or groups have access to create, manage, or modify clusters through Cluster ACLs (Access Control Lists). Ensure that clusters run with specific instance profiles that enforce the minimum privilege needed for tasks like data processing with Instance Profile Management.
Notebooks often contain sensitive information, such as queries, data processing code, or even credentials. Fine-grained notebook permissions help restrict access. Assign permissions at the notebook level, ensuring that only authorized users can read or modify sensitive notebooks. Databricks provides audit logs that track who accessed a notebook and when, helping maintain transparency and track any potential malicious activity
Overview:
Securable Object |
Access Control System |
Workspace-level securable objects |
Access Control Lists |
Account-level securable objects |
Account role based access control (RBAC) |
Data securable objects |
Unity Catalog |
By carefully configuring these access controls, you can create a secure environment where users only have access to the resources they need, significantly reducing the risk of unauthorized access.
For an overview of setting up and managing user permissions, go to Manage Users on Azure Databricks.
Securing Data via Segmentation
Segmenting levels of access via Microsoft EntraID at the Account, Subscriptions, Resource Groups and Resources levels for Databricks compute and data storage is a great first step. For making use of the data inside Databricks, we rely on Unity Catalog as the unified governance solution for data and AI assets, enabling secure access and sharing across clouds and platforms, while providing a central place to manage permissions and audit data access across multiple workspaces.
The Databricks account is linked to the Microsoft Entra tenant ID. Any subscription inside the tenant is mapped to the same Databricks account. Consequently, any workspace in any subscription and resource group will be able to access the same metastore in the respective regions if admins allow this.
In Azure Databricks, a managed identity is used as a storage credential to authenticate with Unity Catalog when accessing ADLS Gen2. Administrators map storage credentials to specific storage accounts (like ADLS Gen2) using external locations, allowing secure access and easier management.
This simplifies administration by enabling users to manage access and data without needing deep cloud-specific knowledge.
Unity Catalog further segments protection of data assets through a three-level namespace that can be used to represent a logical layout of how data is accessible and managed in your organization. The three levels are catalogs, schemas and assets.
Catalogs represent the top level of the structure. They can be made available for selective workspaces, which allows different operating units to enforce that their data is only available within their environments. This also allows teams to enforce the availability of data across software development lifecycle scopes (e.g., prod data in prod, dev data in dev).
Schemas are defined inside the catalogs. They can serve as a grouping for per-domain data assets. Assets are inside schemas. They include the following objects:
For a detailed understanding of how Unity Catalog provides enterprise governance on Databricks, download the UC Data Governance Architecture Patterns Ebook.
Databricks Systems Tables
Databricks provides built-in systems tables which capture metadata information for lakehouse observability and ensure compliance. Examples include which users have access to which data objects; billing tables that provide pricing and usage; compute tables that take cluster usage and warehouse events into consideration; and lineage information between columns and tables
These underlying tables can be queried through SQL or activity dashboards to provide observability about every asset within the Databricks Intelligence Platform.
Specific tables that help implement zero trust architecture for your lakehouse include:
Audit tables:
For Example, to query which tables did a user access recently?:
Table lineage and column lineage tables:
Query history holds information on all SQL commands, i/o performance, and number of rows returned.
Implementing Zero Trust Security within Databricks on Azure is an essential step toward protecting sensitive data and maintaining a secure environment in the cloud. By following Zero Trust principles like explicit verification, least-privilege access, and assumption of breach, organizations can significantly reduce the risk of unauthorized access.
As a best practice, when implementing Azure Databricks, administrators should integrate with Azure Security center to provide additional security monitoring, threat protection and compliance tracking as well as constantly audit user activity.
By applying these practices, you’ll be well on your way to building a secure, Zero Trust-compliant Databricks environment on Azure that safeguards your data and resources effectively
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.