cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
stevejohansen
Databricks Employee
Databricks Employee

Introduction

In part one of this article we introduced the ability to automatically enable Unity Catalog (UC) on new Azure Databricks workspaces and details on exactly what happens during this process. This article will take a deep dive into the same Unity Catalog Workspace Automatic Enablement process when using Databricks workspaces on AWS and highlight the differences.

Much like the previous there is an assumption the reader has an understanding of what Unity Catalog is and how to set it up on AWS, along with how its securables (data objects) and permissions are managed.

What is a Workspace Catalog?

When a workspace is created with Unity Catalog automatically enabled a workspace catalog is created assigned to that workspace. If there is no existing Unity Catalog metastore in the cloud region that the workspace is being created in then a metastore will also be created with the defaults described below. To the Workspace Administrator and Workspace User groups this should look almost identical to how it is done on Azure, with the underlying cloud infrastructure details mostly abstracted away.

uc-by-default-aws-overview.png

The workspace catalog on AWS has the following properties:

  • The name of the workspace catalog will match the workspace name
  • Will be owned by a system owned group called _workspace_admins_${workspace_name}_${workspace_id}
  • Will have its storage root located in the S3 workspace root bucket in a dedicated folder called unity-catalog
  • A system owned group called _workspace_users_${workspace_name}_${workspace_id} which has USE_CATALOG rights on the workspace catalog. This user also has enough rights to create objects in the default schema of the workspace catalog.

The workspace catalog is made up of three Unity Catalog securables:

  • Credential: the biggest difference in AWS is there needs to be an extra IAM role provision for the UC storage credential that will be used for the workspace catalog (details below). The name of the UC credential will match the workspace name.
  • External location: this adds the unity-catalog folder in the S3 workspace root bucket as a valid path in Unity Catalog. The name of this external location is also the Workspace name. The path is s3://${workspace_root}/unity-catalog/${workspace_id}
  • Catalog: this is the Workspace catalog that has a storage root pointing to the unity-catalog folder on the external location. The name of this external location is also the workspace name

All three of these UC securables are bound to the workspace and not by default available to any other workspace sharing the metastore.

How Automatic Workspace Assignment Works

In order to automatically enable a workspace for Unity Catalog there are several requirements that must be met. Automatic enablement of Unity Catalog for AWS requires two main prerequisites:

  • If a metastore exists it must be enabled for automatic assignment
  • A UC storage credential IAM role allowing access to the S3 workspace root bucket must be created

The first requirement is identical enabling automatic assignment on Azure Databricks. As shown in part one of this article only Databricks Accounts created after 9 November 2023 are automatically set up for Automatic Workspace Assignment

uc-by-default-flow-aws.png

When creating a workspace in a region where Automatic Workspace Assignment is enabled on the Account but there is no metastore then a metastore will be created for you in the same way as it in Azure. The properties of this metastore are:

  • The metastore will be called metastore_aws_${cloud_region}
  • The metastore will have no metastore owner (it will show System user)
  • The metastore will be created without a storage root location
  • Delta sharing will be disabled
  • Automatic Workspace Assignment will be enabled

If required a Metastore Owner can be allocated by an Account Administrator.

In order to automatically enable all new workspaces in a region for Unity Catalog on an existing metastore in that region the checkbox in Workspace assignment under the metastore settings in the Catalog section of the Account Console has to be checked.

uc-by-default-aws-metastore-tick.png

The second prerequisite is specific to Databricks on AWS. The documentation states:

Your workspace gets the workspace catalog only if the workspace creator provided an appropriate IAM role and storage location during workspace provisioning.

We will go into detail on how this IAM role works in the section on AWS Infrastructure  below.

Like Azure, when a metastore is assigned to a workspace a default catalog name is set for all users of that workspace. If the workspace is created via the UI and automatically enabled for UC then the default catalog will be the workspace catalog. If the workspace is created via an API (currently supported via the Quickstart CloudFormation template) the default catalog will be the hive_metastore.

AWS infrastructure deployed during automatic enablement

When provisioning a Databricks Workspace on AWS there have always been the following required items of supporting AWS infrastructure:

  • S3 workspace root bucket: This is an S3 bucket used for workspace storage, including DBFS and the workspace filesystem.
  • IAM cross account role: This allows Databricks to launch EC2 instances for cluster nodes in the customer compute plane AWS account. The S3 workspace root bucket includes a bucket policy to allow this cross account access.

When setting up a workspace for automatic UC enablement we also need to provide another IAM role to be used as the UC storage credential for the workspace catalog storage. This role allows UC to access the unity-catalog folder in the S3 workspace root and to put the workspace catalog storage root in that folder.

uc-by-default-aws-infrapng.png

This storage credential IAM role is a standard Unity Catalog self-assuming IAM role, as outlined in the documentation for creating storage credentials in UC. The attached IAM policy on that role grants access to the unity-catalog folder on the UC workspace root bucket. The trust role with the self-assuming policy setup is the same as any other UC storage credential IAM role. This Medium post covers the details.

{
    "Version": "2012-10-17",
    "Id": "databricks-uc-dbfs-bucket-access",
    "Statement": [
        {
            "Action": [
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::anzps-uc-by-default",
                "arn:aws:s3:::anzps-uc-by-default/unity-catalog/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "sts:AssumeRole"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:iam::332745928618:role/anzps-uc-by-default-workspace-cred"
            ]
        }
    ]
}

When provisioning the workspace there are now steps in the S3 workspace storage provisioning to provide your Unity Catalog UC storage credential IAM role ARN. You can also see the generated bucket policy explicitly denies the workspace cross account role used to access the workspace root bucket for use in compute and DBFS denies access to the unity-catalog folder, which ensures all access to that folder must be via Unity Catalog.

uc-by-default-aws-workspace-provisioning.png

System-owned groups and permissions

The system owned groups that are provisioned with the workspace work the same way they do in Azure, and just like in Azure these groups do not appear in most surfaces in the Workspace UI, Account Console or APIs and can not be used to grant Unity Catalog privileges to other securables. The membership of these groups is kept in sync with all the users who have been pushed to the workspace as either the ADMIN or USER role using Identity Federation. These groups have enough permissions for the Workspace Administrators to manage the workspace catalog and for Workspace Users to start using UC in the workspace catalog default schema.

 

Group Name Unity Catalog Grants

Workspace Admin

_workspace_admins_${workspace_name}_${workspace_id}

OWNER on credential, external location and workspace catalog in addition to the metastore level rights listed in the next section

Workspace Users

_workspace_users_${workspace_name}_${workspace_id}

Usage (USE_CATALOG) rights on workspace catalog and  usage rights on default schema (see below)

The following shows the grants on the default schema for the Workspace Users.

uc-by-default-aws-user-grants-default-schema.png

Metastore-level grants for Auto-Enabled Workspace Administrators

The implementation of Unity Catalog Metastore grants to allow the Workspace Admins group to create other top level objects in Unity Catalog is the same as was shown for the Azure deployment. The screenshot also shows this workspace was deployed using the Account Console meaning the default catalog is the workspace catalog.

uc-by-default-aws-metastore-admin-grants.png

These grants do not include ownership of the metastore, meaning the workspace admin can not delete metastore level UC securables that were created or owned by other identities, including the workspace catalog and securables on other workspaces created with UC by default.

These grants also allow the Workspace Administrators to create other catalogs and related underlying securables like credentials and external locations. By default any securable created will be owned by the individual identity that created that securable and ownership allows transfer of ownership to a group.

Best practices for using the workspace catalog

The guidance of using the workspace catalog follows the same given for Azure: the workspace catalog is great for initial enablement but it does also tie your data to the lifecycle of the workspace. The recommendation remains to adhere to existing best practices for creating catalogs, aligning them with SDLC (Software Development Lifecycle), business units, and/or projects. This allows more flexibility to segregate storage away from the workspace and to bind these catalogs to multiple workspaces where required. It also means that the addition or removal of a workspace does not impact the lifecycle of any data stored in Unity Catalog.

The metastore permissions granted to the system owned Workspace Admins group give enough permissions to create the required securables (credentials, external locations, catalogs etc) to achieve the required catalog design for your organisation.

Conclusion

The experience of using Unity Catalog with an automatically provisioned workspace catalog on AWS should not look materially different to the Workspace Admins and Users than it would for an Azure Databricks workspace. The underlying cloud infrastructure is mostly abstracted away but at times it can be useful to understand these differences, especially for those planning the initial deployment of a workspace on either cloud. The end result in both is a way to start using Unity Catalog from the first time the workspace is deployed, allowing the workspace users to get all the benefits of the Databricks Data Intelligence Platform.

For further information and best practices on how to get the most out of UC, please follow the Databricks Unity Catalog SME page on Medium.