cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

unity catalog guidelines for internal/external tables and multiple workspaces

databriccone
New Contributor

We're setting up from scratch the Unity Catalog in our infrastructure in Azure, that is both
- multi region ( europe, us)
- multi env (dev, qa, prod)

So, we did setup 2 metastore, one for each region, one in west europe and one for south central us.
So far, so good.

Now I have a doubt on how to integrate real data with it.
In a separate product we had before the unity catalog was out we had:
- separate ADLS storages for region, environment (So 1 ADLS for DEV, 1 for QA and so on). 
- separate Databricks workspace (1 for DEV, 1 for QA and so on).

So my first approach would be to bind all these brand new ADLS as external locations.
So I'd have to only register catalogs, schemas, tables and volumes in the mestastores, that would contain the metadata only, and have the real data elsewhere.
Besides Databricks this way would not "own" the data, would I have all features available for the unity catalog as for the internal tables?

If I wanted instead to use the internal storage, that is bound to the metastore backed ADLS, I assume I'd have to integrate in the same store DEV, QA and PROD data.

So here are the questions:

- what is the suggested way to proceed with naming conventions? Is it about adding a "DEV, QA, PROD" suffix to catalogs/schemas to distinguish them?
- how about granting access on the different  DEV, QA and PRODcatalogs and schemas for different workspaces ?
 There is a way to grant access to workspace level of do I need to create users and groups on the metastore level?
I assume in this case every workspace should have different credentials, and possibily PROD should be accessible only to highly privileged users and service principals to run PROD workload and pipelines.
 
- what are performance implications ? With internal tables we'd have DEV, QA and PROD data all together, with possibly different retention times, and also different workloads sizes.
DEV and PROD workloads would still use the same ADLS, despite on different data containers of course.
Anyhow I see it as a problem and potential source of bottlenecks: having data in different ADLSs makes me more comfortable, performance wise. Am I worrying too much?









 

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @databricconeSetting up Unity Catalog in your Azure infrastructure is a crucial step for better data governance and operational efficiency. Let’s address your questions:

  1. Naming Conventions:

    • For naming conventions, consider adding a suffix like “DEV,” “QA,” or “PROD” to your catalogs and schemas. This approach helps distinguish them clearly.
    • Example: If you have a catalog named “Sales,” you could create separate schemas like “Sales_DEV,” “Sales_QA,” and “Sales_PROD.”
  2. Access Control and Permissions:

    • Unity Catalog provides fine-grained access controls. You can grant access at different levels:
      • Workspace Level: You can control access to Unity Catalog at the workspace level. Create users and groups within your workspace and assign appropriate permissions.
      • Metastore Level: You can also manage access at the metastore level. However, this approach is less common because it grants broader access across workspaces.
    • Consider the following best practices:
      • Workspace-Specific Credentials: Each workspace should have its own credentials. For example, DEV, QA, and PROD workspaces should use distinct credentials.
      • Highly Privileged Access: Limit access to PROD data to highly privileged users and service principals. This ensures security and compliance.
      • Delta Sharing: Use Delta Sharing for data sharing across regions. It allows controlled access without exposing direct storage paths.
  3. Performance Implications:

    • When using internal tables (bound to the metastore-backed ADLS):
      • Data Coexistence: DEV, QA, and PROD data coexist in the same ADLS. Ensure proper data segregation within the ADLS containers.
      • Retention Times: Manage retention times separately for each workload (DEV, QA, PROD).
      • Workload Sizes: Be aware that different workloads may have varying data sizes. Monitor and optimize storage accordingly.
      • Resource Utilization: Unity Catalog helps prevent resource misuse by enforcing standardized cluster configurations.
      • Cost Control: Unity Catalog ensures precise chargeback processes by accurately tagging clusters, optimizing utilization, and controlling costs.

Remember that Unity Catalog simplifies data management and governance, providing a central place for administering and auditing data access. By following best practices, you can achieve a seamless transition and maximize the benefits of Unity Catalog. 🚀

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.