cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

How to segregate dev/staging/prod data and resources with unity catalog

AlexDavies
Contributor

We are planning on migrating to unity catalog but are unable to determine how we can segregate dev, staging and production data from each other

Our plan was to separate catalogs by SLDC Environment scopes (as per description and diagram at https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)

We would then have a catalog for each environemnt. We would also have a workspace in each of dev/staging/prod that would do the bulk of our processing. Each of these workspaces and their related resources (VMs, vnets, storage etc) would be in separate azure subscriptions, cleanly segregating dev, staging and prod resources from each other

We have a policy that data should never cross dev/staging/prod boundaries. e.g. prod data should never be stored or processed by staging or dev resources. This is a reasonable policy aimed at reducing the chances of sensitive prod data ending up where it shouldn't and prevent inaccurate staging/dev data from accidently influencing production

However unity catalog seem to make all dev/staging/prod data accessible to all workspaces. We can restrict access via user permissions, but there are occurrences where a user may have access to multiple catalogs. What we really need to be able to do is restrict catalogs by workspace, but that doesn't seem to be an option. Alternatively if we could have a multiple metastores in a region we could segregate that way, but that also seems to be prevented

Is there any setup or feature we can use that would segregate data from dev, staging and prod such that data from one environment cant be processed by resources in another?

1 REPLY 1

Anonymous
Not applicable

@Alex Davies​ :

Unity Catalog does not currently support separating data by workspace or Azure subscription. As you noted, data from all catalogs within a region can be accessed by any workspace within that region, and it is up to user permissions to restrict access appropriately.

One potential workaround for your use case could be to use separate Databricks accounts (with separate regions) for each environment (dev, staging, prod). Each Databricks account would have its own Unity Catalog and associated workspaces, and data would be segregated by account. This would allow you to enforce your policy of not allowing data to cross environment boundaries, but would also require managing separate Databricks accounts and associated resources for each environment.

Another option would be to use separate virtual networks (VNets) for each environment, and configure VNet peering between the VNets as appropriate to allow necessary communication between environments. This would allow you to further restrict network traffic between environments, but would require additional configuration and management of VNets and peering relationships.

Ultimately, the best solution will depend on the specific needs and constraints of your organization. It may be worth discussing with the Databricks solutions architect to determine the best approach for your use case.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.