cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks AWS deployment with custom configurations (workspace root storage)

margarita_shir
New Contributor II

Hi everyone,

I have a question about the IAM role for workspace root storage when deploying Databricks on AWS with custom configurations (customer-managed VPC, storage configurations, credential configurations, etc.).

At an earlier stage of our deployment, I was following the manual setup documentation here:

https://docs.databricks.com/aws/en/admin/workspace/create-uc-workspace

Specifically this step:

https://docs.databricks.com/aws/en/admin/workspace/create-uc-workspace#create-a-storage-configuratio...

This section describes creating a storage configuration for the workspace root S3 bucket and includes creating an IAM role that Databricks assumes to access this bucket.

However, when managing the same setup via Terraform, the equivalent resource: databricks_mws_storage_configurations (as documented here:

https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-workspace#root-...) does not support specifying an IAM role at all, and the Terraform documentation fully omits creating or attaching a role for the root bucket.

This raised a few questions for me:

Was the IAM role originally intended for Unity Catalog storage within the root bucket, but has since been deprecated in favor of separate storage?

Initially, I thought it might be a good idea to explicitly specify an S3 bucket path in the metastore resource (so-called metastore-level storage), but after reading more documentation, I realized that Databricks best practices recommend assigning storage at the catalog level (this one is managed by the use of external locations and storage credentials) and this is a separate S3 bucket separate from the root S3 bucket that is used for storing workspace assets (such as data, libraries, and logs). Hence we create the managed catalogs by specifying an external location resource, and Databricks automatically auto-generates the subpath (e.g., s3://databricks-unitycatalog/cps_business_insights/__unitystorage/catalogs/1234fda622-2cfb-478f-bbc4-b9cb84242baf).

Is the modern best practice to use: Root S3 bucket (accessed via bucket policy only) โ†’ stores workspace assets (notebooks, cluster logs, libraries), Separate Unity Catalog metastore bucket (with its own IAM role)

Can anyone clarify if this understanding is correct from a security best practices perspective?

Thanks in advance!

1 REPLY 1

MoJaMa
Databricks Employee
Databricks Employee

Great question.

[1] In the pre-UC world when you created a workspace you would designate a bucket/container that was used for what was most commonly known as DBFS. ie it's where the hive-metastore managed tables would be stored by default along with other system assets such as job/cluster logs etc.

[2] When we moved to the UC world, the above stayed the same and no UC data (ie managed tables) were stored there. For UC, the recommendation, in the beginning was to bring another bucket/container for metastore root storage (for managed tables).

[3] Later we introduced catalog/schema location/storage (for managed tables) and made metastore root storage optional.

[4] We then launched a feature called "UC by default" which created a "default" metastore if you created a workspace in a region which did not have your own UC metastore created ahead of time. Now for this "uc by default metastore" a "default workspace uc catalog" was created and for that storage we piggy-backed on [1].

That's why here: https://docs.databricks.com/aws/en/admin/workspace/create-uc-workspace#step-1-create-an-s3-bucket the last deny is to prevent writes into that "default workspace uc catalog" storage area by anything other than the UC Master Role (which you do later in Step 2).

To summarize, if I was starting with Databricks today (and leaving aside the feature called serverless workspaces and default storage, which basically means you don't have to bring any storage at all during workspace creation time), the best practice would be to

1. Create the workspace as shown in the doc you linked. This will end up creating a default workspace uc catalog for you. You don't have to use this catalog at all but a default metastore is created at the same time in the backend if you did not have a metastore in that region. (Think of the metastore as just an ID in a database - there's no role or storage specifically for it)

2. Now create catalogs as per your design (by LoB, by SDLC, by LoB*SDLC etc). When you create each catalog "bring" an IAM role with you that you register as a storage credential and bring a bucket that you register as an external location governed by that IAM role, and use that external location as your "managed storage" location.

3. Repeat for schemas (if you need data isolation at schema level)

Just to re-iterate: "Separate Unity Catalog metastore bucket (with its own IAM role)" --> don't do this.
A metastore doesn't need a role or storage.
Do that at the catalog and schema levels. ie [2] and [3] above.

Hope this helps.