How to segregate dev/staging/prod data and resources with unity catalog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2023 04:16 AM
We are planning on migrating to unity catalog but are unable to determine how we can segregate dev, staging and production data from each other
Our plan was to separate catalogs by SLDC Environment scopes (as per description and diagram at https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)
We would then have a catalog for each environemnt. We would also have a workspace in each of dev/staging/prod that would do the bulk of our processing. Each of these workspaces and their related resources (VMs, vnets, storage etc) would be in separate azure subscriptions, cleanly segregating dev, staging and prod resources from each other
We have a policy that data should never cross dev/staging/prod boundaries. e.g. prod data should never be stored or processed by staging or dev resources. This is a reasonable policy aimed at reducing the chances of sensitive prod data ending up where it shouldn't and prevent inaccurate staging/dev data from accidently influencing production
However unity catalog seem to make all dev/staging/prod data accessible to all workspaces. We can restrict access via user permissions, but there are occurrences where a user may have access to multiple catalogs. What we really need to be able to do is restrict catalogs by workspace, but that doesn't seem to be an option. Alternatively if we could have a multiple metastores in a region we could segregate that way, but that also seems to be prevented
Is there any setup or feature we can use that would segregate data from dev, staging and prod such that data from one environment cant be processed by resources in another?
- Labels:
-
SLDC
-
SLDC Environemtn
-
Unity Catalog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-10-2023 08:07 AM
@Alex Davies :
Unity Catalog does not currently support separating data by workspace or Azure subscription. As you noted, data from all catalogs within a region can be accessed by any workspace within that region, and it is up to user permissions to restrict access appropriately.
One potential workaround for your use case could be to use separate Databricks accounts (with separate regions) for each environment (dev, staging, prod). Each Databricks account would have its own Unity Catalog and associated workspaces, and data would be segregated by account. This would allow you to enforce your policy of not allowing data to cross environment boundaries, but would also require managing separate Databricks accounts and associated resources for each environment.
Another option would be to use separate virtual networks (VNets) for each environment, and configure VNet peering between the VNets as appropriate to allow necessary communication between environments. This would allow you to further restrict network traffic between environments, but would require additional configuration and management of VNets and peering relationships.
Ultimately, the best solution will depend on the specific needs and constraints of your organization. It may be worth discussing with the Databricks solutions architect to determine the best approach for your use case.