cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Access Control in Databricks

isaac_gritz
Valued Contributor II
Valued Contributor II

Best Practices for Securing Access to Data in Databricks

Unity Catalog is the unified governance solution for Data & AI assets in Databricks and greatly simplifies and centralized data access control. This guide includes best practices for both the streamlined approach with Unity Catalog as well as the approach without Unity Catalog.

Data Access Control with Unity Catalog

Unity Catalog elevates access to files, databases, tables, rows, and columns and more to the metastore level rather than the cluster level and allows you to set and users, groups, and permissions across workspaces.

Continued below

5 REPLIES 5

isaac_gritz
Valued Contributor II
Valued Contributor II
  1. To enable a workspace for Unity Catalog:
    1. Create an S3 bucket and IAM role (AWS | GCP) or Access Connector (Azure) that Unity Catalog will use as the default for managed tables (AWS | Azure | GCP)
    2. Create a metastore using that IAM role (AWS | GCP) or Access Connector (Azure) and attach that metastore to each of the workspace you would like have access to that metastore.
  2. For securing access to buckets, folders, and blobs in S3/ADLS/GCS:
    1. For access to data in the default S3/ADLS/GCS bucket/container:
      1. A Managed Storage Credential (AWS | Azure | GCP) was automatically created when the metastore was set up.
      2. Create an External Location (AWS | Azure | GCP) using that Managed Storage Credential to scope down access to the specific storage path within that bucket/container you want to grant access to.
      3. Grant access to that External Location to the groups that you want to be able to read/write or create tables on top of those S3/ADLS/GCS locations (AWS | Azure | GCP)

Continued Below

isaac_gritz
Valued Contributor II
Valued Contributor II
  1. To enable a workspace for Unity Catalog: (see above)
  2. For security access to buckets, folders, and blobs in S3/ADLS/GCS: (see above)
    1. For access to data in the default S3/ADLS/GCS bucket/container: (see above)
    2. For access to data in external S3/ADLS/GCS buckets/containers:
      1. Create an IAM role (AWS | GCP) or Managed Identity (Azure) to provide access to this S3/ADLS/GCS bucket/container.
      2. Create a Storage Credential with that IAM role (AWS | GCP) or Managed Identity (Azure
      3. Create an External Location (AWS | Azure | GCP) using that Managed Storage Credential to scope down access to the specific storage path within that bucket/container you want to grant access to.
      4. Grant access to that External Location to the groups that you want to read/write/create tables on top of to those S3/ADLS/GCS locations (AWS | Azure | GCP)
  3. For database, tables:
    1. Use the UI or SQL to grant/revoke access to Databases, Tables (AWS | Azure | GCP)
  4. Enable clusters and SQL warehouses to leverage Unity Catalog
    1. Enable Shared (SQL, Python), or Single User (R, Scala) security mode on DS&E clusters (AWS | Azure | GCP)
    2. Databricks SQL warehouses are enabled for Unity Catalog by default
  5. Fine-grained access control
    1. Row and column level security and dynamic data masking can be administered using Dynamic View Functions (AWS | Azure | GCP)

Continued Below

isaac_gritz
Valued Contributor II
Valued Contributor II

Data Access Control without Unity Catalog

Prior to Unity Catalog, data access was controlled at the cluster level using Table Access Controls.

  1. For securing access to buckets, folders, and blobs in S3/ADLS/GCS:
    1. Create an IAM role and instance profile (AWS) that has access to the to the AWS S3 buckets/folders you want to grant to a team, create a Service Principal for access to ADLS Gen2 containers/blobs (Azure), or use a Service Account to connect to a GCS bucket (GCP).
    2. Attach the instance profile to the DS&E cluster (AWS), mount the ADLS Gen2 container to the workspace using the Service Principal (Azure), or add the GCP Service Account email to the DS&E cluster (GCP). 
    3. Use cluster entitlements (AWS | Azure | GCP) to turn off unrestricted cluster access to DS&E groups
    4. Provide access to that cluster or cluster policy using Cluster ACLs (AWS | Azure | GCP)

Continued Below

isaac_gritz
Valued Contributor II
Valued Contributor II
  1. For securing access to buckets, folders, and blobs in S3/ADLS/GCS: (see above)
  2. For database, tables:
    1. Use cluster entitlements (AWS | Azure | GCP) to turn off unrestricted cluster access to groups. Or restrict them only to Databricks SQL.
      1. For using SQL/Python within notebooks but restricting access to Databases/Tables
        1. Create a cluster that has Shared Access mode (AWS | Azure | GCP) enabled
        2. Provide access to that cluster or policy using Cluster ACLs (AWS | Azure | GCP)
        3. Use SQL GRANT statements (AWS | Azure | GCP) to grant/revoke permissions
      2. For using Databricks SQL but restricting access to Databases/Tables
        1. Databricks SQL Warehouses automatically have Shared Access mode enabled
        2. Use the Databricks SQL UI or SQL (AWS | Azure | GCP) to grant/revoke access to Databases, Tables
  3. Fine-grained access control
    1. Row and column level security and dynamic data masking can be administered using Dynamic View Functions (AWS | Azure | GCP)

isaac_gritz
Valued Contributor II
Valued Contributor II

Let us know if this walkthrough helped you set up data access control and let us know how your journey to leveraging Unity Catalog is going!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.