cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Shu_Li
Databricks Employee
Databricks Employee

Databricks has deprecated IAM Passthrough and Table ACLs with the release of DBR 15.0. This transition marks a shift towards an unified, secure, and efficient approach to Data and AI governance with Unity Catalog.

Unity Catalog offers a centralized platform that simplifies the management of Data and AI assets across Databricks workspaces, providing enhanced capabilities in access control, auditing, lineage, and data discovery.

Let’s dive into what Table ACL and IAM Passthrough is and how Unity Catalog addresses the limitations of these systems.

What are Table Access Control Lists (TACLs) - Legacy

Databricks HMS (Hive Metastore) table ACL (Access Control List) is a legacy data governance feature that allows control over access to database objects in the Hive metastore.

Before we get into the details of how Table Acl works, Let’s first take a look at the differences between built-in HMS and external HMS

  1. Built-in Hive Metastore (HMS)
    1. Databricks workspace deploys with a built-in HMS as a managed service. By default, all users have access to all data managed by the built-in HMS. 
    2. It may not support certain naming or hierarchical storage conventions that your organization uses.
    3. Governance is limited to a single workspace, which may be insufficient for organizations requiring cross-workspace data management.
  2. External HMS
    1. An external HMS is a configuration where Databricks clusters link to an existing HMS not managed by Databricks. External HMS typically relies on Apache Ranger or similar tools for access control, while built-in HMS uses Databricks' native Table Access Control feature.
    2. When integrated with specialized security tools, such as Apache Ranger, external HMS can potentially provide more fine-grained access control, including column-level and row-level security, whereas built-in HMS Table Access Control is primarily focused on table-level permissions.
    3. Built-in HMS integrates seamlessly with Databricks' ecosystem, while external HMS may require additional configuration to work with cloud storage services like Amazon S3 or Azure Data Lake Storage.
    4. Flexibility: It allows you to enforce specific naming or hierarchical storage conventions.
    5. It can be shared across multiple workspaces.
    6. Performance Considerations: Depending on the setup, might have different performance characteristics, especially for metadata-heavy operations.

How do TACLs Work? 

  1. TACL allows setting table-level privileges on Hive metastore objects.
  2. It's disabled by default for Data Science and Machine Learning clusters but enabled by default for Databricks SQL endpoints.
  3. Workspace administrators must enable TACL for the entire workspace before it can be used on individual clusters.
  4. For non-DBSQL clusters, TACL needs to be explicitly enabled in the cluster configuration.
  5. To ensure security, it's recommended to restrict cluster creation to admin users or limit users to SQL-only access.
  6. Note that Databricks workspace administrators retain file-level data access even when TACL is enabled.

Pros and Cons of Using TACLs

Advantages:

  1. When TACL is enabled, administrators can programmatically grant and revoke access to tables and views using Python and SQL. There are two modes: SQL-only, which restricts users to SQL commands, and Python and SQL, which allows broader access.
  2. Provides table-level access control, allowing administrators to grant or revoke permissions programmatically.

Disadvantages:

  1. It lacks fine-grained access control capabilities such as column-level or row-level security, and provides very limited auditing capabilities compared to more advanced solutions.
  2. TACL must be enabled on a per-cluster or per workspace (DBSQL) basis to restrict access.
  3. It lacks advanced features like data lineage, centralized policy management, and automated data discovery.
  4. It may have limitations for large-scale enterprise deployments with complex governance requirements.

Why Does Databricks Deprecate TACLs?

Databricks has deprecated Table Access Control Lists (ACLs) for both built-in and external HMS in favor of the Unity Catalog for several reasons:

  1. Security Model: Unity Catalog is a secure-by-default feature. Clusters that can access Unity Catalog are guaranteed to be access controlled. Databricks disallows Unity Catalog access from clusters which are not configured to be secure. This is different from Table ACLs where, if a cluster doesn’t have Table ACL enabled, it can access everything without access control.
  2. Access Control: Unity Catalog provides a more granular level of access control. In order to read data from a table or view, a user must have SELECT on the table or view, USE SCHEMA on the schema that owns the table, and USE CATALOG on the catalog that owns the schema. This allows schema and catalog owners to limit how far individual table owners can share data they produce.
  3. Workspace Boundaries: Unity Catalog provides features like Catalog Workspace Bindings and Read Only Bindings that allow you to control access to catalogs from certain workspaces and ensure certain catalogs cannot be accessed together if you have specific requirements to control the combination of data.
  4. Integration with External Access Control Solutions: Unity Catalog supports external access control solutions like Immuta and Privicera. These tools synchronize their policies with Unity via REST APIs, and Unity enforces the policies at query time.
  5. Scalability: Not a scalable governance solution due to their performance limitations, management complexity, and inability to adapt to dynamic access requirements in large-scale data environments.

What is IAM Passthrough Passthrough

The TL;DR version is it is acls on files not tables:Table ACL

IAM Passthrough allows you to authenticate automatically to S3 buckets from Databricks clusters using the identity that you use to log in to Databricks. When it is enabled on a cluster, commands that you run on that cluster can read and write data in S3 using your identity. IAM Passthrough passes cloud provider tokens to clusters for data access, Its governance model allows multiple users with different data access policies to share one Databricks cluster to access data in S3 while maintaining data security.

IAM Passthrough (1).png

Why Does Databricks Deprecate IAM Passthrough?

  1. IAM Passthrough only supports coarse grained access control on files
  2. IAM Passthrough do not provide secure access to tables
  3. Limited Workflow Support: For Job and JDBC workflows, there is no Identity Provider (IDP) involved, so credentials can't be obtained through SAML. Databricks has built a separate workflow to support these use cases, but it may not cover all scenarios
  4. Security Concerns: On a regular Databricks cluster with an instance profile attached, users have full root control inside the container. The instance profile’s role credential is open for access within the container, which could potentially lead to security issues
  5. Frequent RPC Calls: After the driver or the executor gets the credentials, they are put into a local cache IAMCredentialCache to avoid too frequent remote procedure call (RPC) calls to the control plane. However, if the cache expires, it would try to ask the data daemon for a new token, which could lead to performance issues.
  6. Limited Language Support: On shared clusters, only Python and SQL are supported with IAM Passthrough. 
  7. Role Assumption: If a user does not explicitly assume a role, then the cluster will use the first role in the list. This could potentially lead to access issues if the first role does not have the necessary permissions

What is Unity Catalog and How does Unity Catalog address the limitations of IAM Passthrough and Table Acl

Unity Catalog is a comprehensive data governance solution designed to manage data and AI assets within the Databricks ecosystem. It provides a centralized platform for access control, auditing, data lineage, and data discovery across Databricks workspaces.

How a query works with Unity Catalog.png

 

Unity Catalog provides several benefits for managing both data and AI assets compared to Table Access Control Lists (ACLs) in HMS.

  1. Unified Visibility: Unity Catalog allows you to discover and classify structured and unstructured data, files, notebooks, ML models, and AI tools in one place. This unified visibility is beneficial for managing both data and AI assets as it provides a comprehensive view of your resources.
  2. Cross-Platform Data Access and Sharing: Unity Catalog enables seamless management, governance, and querying of data from external databases, data warehouses, and catalogs. This open data sharing helps unlock collaboration and monetize the value of data.
  3. AI-Powered Monitoring and Observability: Unity Catalog offers AI-powered monitoring and observability, which minimizes compliance risks. This feature is particularly useful for AI assets as it ensures that they are used in compliance with regulations.
  4. Fine-Grained Access Control: As mentioned earlier, Unity Catalog provides permissions on tables, rows, and columns, as well as on ML models, features, and reports. This fine-grained access control is beneficial for managing AI assets as it allows for precise control over who can access what resources.
  5. Security and Compliance: Unity Catalog helps in meeting regulatory compliance by providing detailed audit logs and ensuring that data and AI assets are secure and meeting compliance requirements across platforms.

Conclusion

In conclusion, embracing Unity Catalog future-proofs your Data and AI governance landscape. As Databricks phases out IAM Passthrough and Table ACL, organizations that proactively adopt Unity Catalog will be better positioned to innovate and maintain a competitive edge in their Data and AI journey.

1 Comment