Databricks Community

pernilak · ‎03-07-2024

When setting up Unity Catalog, it is recommended by Databricks to figure out your data isolation model when it comes to physically separating your data into different storage accounts and/or contaners. There are so many options, it can be hard to be confident in the solution you choose. Some alternatives we are looking into are:

Should all catalogs and the metastore reside in the same storage account (but different containers)
Should the metastore have one storage account and other catalogs reside in a different one (separate containers)
Should dev, test and prod catalogs be in different storage accounts?
Should one domain (we have catalogs based on domain) be in one storage account, but then have dev, test and prod catalogs in different containers?
Should data be separated based on the requirements for retention and backup?
Or should we separate data on schemas (different containers or storage accounts?)?
Should some schemas not reside in the same storage account as the catalog?

What are your thoughts on this subject. What are the pros and cons of the different methods based on your experience?

Wojciech_BUK · ‎03-07-2024

Hello,

i think there is no simple answer and all depends on use case, i can try to give you some hints I follow:

1) Should the metastore have one storage account and other catalogs reside in a different one (separate containers)
Avoid Metastore central storage. It is no longer required and it is creating architectural mess. Focus on assigning default storage location at leas on each Catalog. Multiple catalogs can have same storage associated with it.

2) Look at possible Storage Account limits - if you have really big system and if you try to put all data in one Storage you can face Request limits and throttling. E.g. your jobs or queries can stuck on those limits.
Make sure you distribute workload across many Storages, there is no additional "fee" for having multiple storage accounts ... but ...

3) If you plan to use private endpoint - don't create too many Storage Accounts use separation on Containers. Private Endpoint cost you ~8 USD each month and if you place too many Storage Accounts, you will suddenly pay a lot for idle Private Endpoints.

4) Make it easy to manage - I find some architectural concept easier to manage then other. E.g. for data archiving I am making table Clone. Clone always lands to Catalog with suffix _archive. Those Catalogs have separate storage, where i put Storage Policy, to move data to Cool and/or Archive tier. I apply this policy to entire Storage. Just try to make it easy for you.

5) External Location - this can be your only separator for Env / Department when you don't have any strict security requirements.

6) Cost Management - imagine you have multiple divisions. If each division need to be corss-charged for data (read, write and storage) I find it super easy to create separate storage for each division and charge them for any cost associated with this Storage.
If you don't do this - it is really hard to make this calculation e.g. calculating each table data file sizes .

7) Environment separation - separate environments. Small project without restrictions - i would separate on Container level. Bigger projects, more restriction - separation on Storage level (then I put storages on separate subscription and VNETs).
Remember if you create like 100 Storages and 10 Databricks Workspaces you might have administration headache allowing Cluster Subnet to reach your storages, that will create additional layer when divisions would like to share data between each other.

😎 Regionalization requirement - this will basically mean you have to create separate workspace and storage in dedicated region (maybe even metastore) and map certain level Catalog / Schema to this storage

9) Schema Level - I try to design my Metastore(s) in way that i am not putting schemas to different Storage Accounts. Still I am assigning separate default location to /<container>/<schema_name>/ storage path.
But this is because i separate e.g. division on catalog level, if you would come up with idea of separating division on schema level, this would be ok to separate storage on schema level.

View solution in original post

Wojciech_BUK · ‎03-07-2024