We're setting up from scratch the Unity Catalog in our infrastructure in Azure, that is both
- multi region ( europe, us)
- multi env (dev, qa, prod)
So, we did setup 2 metastore, one for each region, one in west europe and one for south central us.
So far, so good.
Now I have a doubt on how to integrate real data with it.
In a separate product we had before the unity catalog was out we had:
- separate ADLS storages for region, environment (So 1 ADLS for DEV, 1 for QA and so on).
- separate Databricks workspace (1 for DEV, 1 for QA and so on).
So my first approach would be to bind all these brand new ADLS as external locations.
So I'd have to only register catalogs, schemas, tables and volumes in the mestastores, that would contain the metadata only, and have the real data elsewhere.
Besides Databricks this way would not "own" the data, would I have all features available for the unity catalog as for the internal tables?
If I wanted instead to use the internal storage, that is bound to the metastore backed ADLS, I assume I'd have to integrate in the same store DEV, QA and PROD data.
So here are the questions:
- what is the suggested way to proceed with naming conventions? Is it about adding a "DEV, QA, PROD" suffix to catalogs/schemas to distinguish them?
- how about granting access on the different DEV, QA and PRODcatalogs and schemas for different workspaces ?
There is a way to grant access to workspace level of do I need to create users and groups on the metastore level?
I assume in this case every workspace should have different credentials, and possibily PROD should be accessible only to highly privileged users and service principals to run PROD workload and pipelines.
- what are performance implications ? With internal tables we'd have DEV, QA and PROD data all together, with possibly different retention times, and also different workloads sizes.
DEV and PROD workloads would still use the same ADLS, despite on different data containers of course.
Anyhow I see it as a problem and potential source of bottlenecks: having data in different ADLSs makes me more comfortable, performance wise. Am I worrying too much?