09-19-2024 12:51 PM
Hello,
We are using Azure Databricks in a single tenant. We will have many teams working in multiple (Unity Enabled) Workspaces using a variety of Catalogs, External Locations, Storage Credentials, ect. Some of those resources will be shared (e.g., an External Location for a common Storage Account), and some will be specific to a team or Workspace. Our parent company controls the Admin Account, so most work will need to be done in the context of a Workspace (via Workspace Admin permissions).
While setting up our infrastructure (managed by Terraform) I realized I might have the wrong mental model. This diagaram shows that Storage credential, External location, and Catalog are not directly associated with a Workspace. However, they must be created and are initially "connected" to a Workspace. It seems you can immediately "disconnect" a Catalog from a Workspace after creation, but this feels a little awkward.
What is considered best practice for organizing and maintaing these resources in a company with many bsuiness units via Terraform?
I suspect all of this has been asked and discussed before, but my googlin skills failed me today. I would be happy to read a 10,000 word blog if anyone can point me in the right direction.
Thanks!
10-03-2024 02:52 PM
We are using Azure Databricks in a single tenant. We will have many teams working in multiple (Unity Enabled) Workspaces using a variety of Catalogs, External Locations, Storage Credentials, etc. Some of those resources will be shared (e.g., an External Location for a common Storage Account), and some will be specific to a team or Workspace. Our parent company controls the Admin Account, so most work will need to be done in the context of a Workspace (via Workspace Admin permissions).
While setting up our infrastructure (managed by Terraform), I realized I might have the wrong mental model. This diagram shows that Storage credential, External location, and Catalog are not directly associated with a Workspace. However, they must be created and are initially "connected" to a Workspace. It seems you can immediately "disconnect" a Catalog from a Workspace after creation, but this feels a little awkward.
So most catalog operations, such as creating catalogs, external locations, adding credentials are not usually done by admins, but by Databricks users and data stewards. So because of that, these operations are always done within a workspace, even though they may impact other workspaces. Since most databricks users only have access to a workspace (or multiple workspaces) and not the account console, it makes sense to do these operations within the workspace.
I also want to clarify that when you create any catalog, external location, or storage credential, these can be tied to a specific workspace, or you can share this with other workspaces. I understand that it’s a little awkward to immediately disconnect a catalog from other workspaces after creation, which adds an additional step. I will share that feedback with the Unity Catalog team.
What is considered best practice for organizing and maintaining these resources in a company with many business units and teams via Terraform?
In your situation, where you have many different business units and you don’t want IT logging into any one specific business unit’s workspace, it would make sense to have an “Administrative Workspace”. This allows IT to perform some operations that may apply for other business units- such as delegating metastore permissions, adding catalogs, etc. This is particularly useful if the admins are helping design/manage the metastore.
So it is quite common to have catalogs that are a combination of business unit and prod/dev/stage. Then within each business unit, each team may have it's own schema(s). Since catalogs can be assigned to specific workspaces, you would use catalogs to enforce logical/physical separation of data (e.g. ensuring that users can’t access production data from a dev workspace and that it is impossible to commingle data from business unit A with data from business unit B).
Regarding collision of resources— if you create separate catalogs for different business units and dev/stage/prod, you will need to put them in different blobs containers or sub directories within those containers. So there needs to be some separation at the Azure storage level. If certain data is located on a specific storage container and is shared across multiple groups/teams, that data cannot be registered in multiple different catalogs— it needs to be registered once and you should use Unity Catalog to manage the use of that shared resource. Any shared data should be exist within the same catalog/schema location.
10-03-2024 02:52 PM
We are using Azure Databricks in a single tenant. We will have many teams working in multiple (Unity Enabled) Workspaces using a variety of Catalogs, External Locations, Storage Credentials, etc. Some of those resources will be shared (e.g., an External Location for a common Storage Account), and some will be specific to a team or Workspace. Our parent company controls the Admin Account, so most work will need to be done in the context of a Workspace (via Workspace Admin permissions).
While setting up our infrastructure (managed by Terraform), I realized I might have the wrong mental model. This diagram shows that Storage credential, External location, and Catalog are not directly associated with a Workspace. However, they must be created and are initially "connected" to a Workspace. It seems you can immediately "disconnect" a Catalog from a Workspace after creation, but this feels a little awkward.
So most catalog operations, such as creating catalogs, external locations, adding credentials are not usually done by admins, but by Databricks users and data stewards. So because of that, these operations are always done within a workspace, even though they may impact other workspaces. Since most databricks users only have access to a workspace (or multiple workspaces) and not the account console, it makes sense to do these operations within the workspace.
I also want to clarify that when you create any catalog, external location, or storage credential, these can be tied to a specific workspace, or you can share this with other workspaces. I understand that it’s a little awkward to immediately disconnect a catalog from other workspaces after creation, which adds an additional step. I will share that feedback with the Unity Catalog team.
What is considered best practice for organizing and maintaining these resources in a company with many business units and teams via Terraform?
In your situation, where you have many different business units and you don’t want IT logging into any one specific business unit’s workspace, it would make sense to have an “Administrative Workspace”. This allows IT to perform some operations that may apply for other business units- such as delegating metastore permissions, adding catalogs, etc. This is particularly useful if the admins are helping design/manage the metastore.
So it is quite common to have catalogs that are a combination of business unit and prod/dev/stage. Then within each business unit, each team may have it's own schema(s). Since catalogs can be assigned to specific workspaces, you would use catalogs to enforce logical/physical separation of data (e.g. ensuring that users can’t access production data from a dev workspace and that it is impossible to commingle data from business unit A with data from business unit B).
Regarding collision of resources— if you create separate catalogs for different business units and dev/stage/prod, you will need to put them in different blobs containers or sub directories within those containers. So there needs to be some separation at the Azure storage level. If certain data is located on a specific storage container and is shared across multiple groups/teams, that data cannot be registered in multiple different catalogs— it needs to be registered once and you should use Unity Catalog to manage the use of that shared resource. Any shared data should be exist within the same catalog/schema location.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group