cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pros and cons of physically separating data in different storage accounts and containers

pernilak
New Contributor III

When setting up Unity Catalog, it is recommended by Databricks to figure out your data isolation model when it comes to physically separating your data into different storage accounts and/or contaners. There are so many options, it can be hard to be confident in the solution you choose. Some alternatives we are looking into are:

 

  • Should all catalogs and the metastore reside in the same storage account (but different containers)
  • Should the metastore have one storage account and other catalogs reside in a different one (separate containers)
  • Should dev, test and prod catalogs be in different storage accounts?
  • Should one domain (we have catalogs based on domain) be in one storage account, but then have dev, test and prod catalogs in different containers?
  • Should data be separated based on the requirements for retention and backup?
  • Or should we separate data on schemas (different containers or storage accounts?)?
  • Should some schemas not reside in the same storage account as the catalog?

What are your thoughts on this subject. What are the pros and cons of the different methods based on your experience? 

1 ACCEPTED SOLUTION

Accepted Solutions

Wojciech_BUK
Contributor III

Hello,

i think there is no simple answer and all depends on use case, i can try to give you some hints I follow:

1) Should the metastore have one storage account and other catalogs reside in a different one (separate containers)
Avoid Metastore central storage. It is no longer required and it is creating architectural mess. Focus on assigning default storage location at leas on each Catalog. Multiple catalogs can have same storage associated with it.

2) Look at possible Storage Account limits - if you have really big system and if you try to put all data in one Storage you can face Request limits and throttling. E.g. your jobs or queries can stuck on those limits. 
Make sure you distribute workload across many Storages, there is no additional "fee" for having multiple storage accounts ... but ...


3) If you plan to use private endpoint - don't create too many Storage Accounts use separation on Containers. Private Endpoint cost you ~8 USD each month and if you place too many Storage Accounts, you will suddenly pay a lot for idle Private Endpoints.

4) Make it easy to manage - I find some architectural concept easier to manage then other. E.g. for data archiving I am making table Clone. Clone always lands to Catalog with suffix _archive. Those Catalogs have separate storage, where i put Storage Policy, to move data to Cool and/or Archive tier. I apply this policy to entire Storage. Just try to make it easy for you.

5) External Location - this can be your only separator for Env / Department when you don't have any strict security requirements.  

6) Cost Management - imagine you have multiple divisions. If each division need to be corss-charged for data (read, write and storage) I find it super easy to create separate storage for each division and charge them for any cost associated with this Storage. 
If you don't do this - it is really hard to make this calculation e.g. calculating each table data file sizes .

7) Environment separation - separate environments. Small project without restrictions - i would separate on Container level. Bigger projects, more restriction - separation on Storage level (then I put storages on separate subscription and VNETs).
Remember if you create like 100 Storages and 10 Databricks Workspaces you might have administration headache allowing Cluster Subnet to reach your storages, that will create additional layer when divisions would like to share data between each other.

😎 Regionalization requirement - this will basically mean you have to create separate workspace and storage in dedicated region (maybe even metastore) and map certain level Catalog / Schema to this storage

9) Schema Level - I try to design my Metastore(s) in way that i am not putting schemas to different Storage Accounts. Still I am assigning separate default location to /<container>/<schema_name>/ storage path.
But this is because i separate e.g. division on catalog level, if you would come up with idea of separating division on schema level, this would be ok to separate storage on schema level.  

View solution in original post

2 REPLIES 2

Wojciech_BUK
Contributor III

Hello,

i think there is no simple answer and all depends on use case, i can try to give you some hints I follow:

1) Should the metastore have one storage account and other catalogs reside in a different one (separate containers)
Avoid Metastore central storage. It is no longer required and it is creating architectural mess. Focus on assigning default storage location at leas on each Catalog. Multiple catalogs can have same storage associated with it.

2) Look at possible Storage Account limits - if you have really big system and if you try to put all data in one Storage you can face Request limits and throttling. E.g. your jobs or queries can stuck on those limits. 
Make sure you distribute workload across many Storages, there is no additional "fee" for having multiple storage accounts ... but ...


3) If you plan to use private endpoint - don't create too many Storage Accounts use separation on Containers. Private Endpoint cost you ~8 USD each month and if you place too many Storage Accounts, you will suddenly pay a lot for idle Private Endpoints.

4) Make it easy to manage - I find some architectural concept easier to manage then other. E.g. for data archiving I am making table Clone. Clone always lands to Catalog with suffix _archive. Those Catalogs have separate storage, where i put Storage Policy, to move data to Cool and/or Archive tier. I apply this policy to entire Storage. Just try to make it easy for you.

5) External Location - this can be your only separator for Env / Department when you don't have any strict security requirements.  

6) Cost Management - imagine you have multiple divisions. If each division need to be corss-charged for data (read, write and storage) I find it super easy to create separate storage for each division and charge them for any cost associated with this Storage. 
If you don't do this - it is really hard to make this calculation e.g. calculating each table data file sizes .

7) Environment separation - separate environments. Small project without restrictions - i would separate on Container level. Bigger projects, more restriction - separation on Storage level (then I put storages on separate subscription and VNETs).
Remember if you create like 100 Storages and 10 Databricks Workspaces you might have administration headache allowing Cluster Subnet to reach your storages, that will create additional layer when divisions would like to share data between each other.

😎 Regionalization requirement - this will basically mean you have to create separate workspace and storage in dedicated region (maybe even metastore) and map certain level Catalog / Schema to this storage

9) Schema Level - I try to design my Metastore(s) in way that i am not putting schemas to different Storage Accounts. Still I am assigning separate default location to /<container>/<schema_name>/ storage path.
But this is because i separate e.g. division on catalog level, if you would come up with idea of separating division on schema level, this would be ok to separate storage on schema level.  

raphaelblg
Contributor III
Contributor III

Hello @pernilak ,

Thanks for reaching out to Databricks Community! My name is Raphael, and I'll be helping out.

Should all catalogs and the metastore reside in the same storage account (but different containers)

 

raphaelblg_1-1711062085475.png

Yes, Databricks recommends having one separate storage location (container) per catalog. But you can also have one single container for the whole metastore (metastore-level storage). If you need to isolate your data at infrastructure level (i.e separate storage accounts) then the best practice is to use External Locations but you can't create a whole catalog in an external location, only other small entities such as tables or volumes.

For information to help you decide whether you need metastore-level storage, see (Optional) Create metastore-level storage and Data is physically separated in storage.

 

Should the metastore have one storage account and other catalogs reside in a different one (separate containers)

Metastore and catalogs should reside in the same storage account, you can have one container per catalog or one container for all metastore entities, it's up to you to decide. My answer for your question no.1 has the auxiliar doc urls that should help you understand which option is better to you.

Should dev, test and prod catalogs be in different storage accounts?

I don't think that it's possible, if you want to work with separate storage accounts then you should use External Locations.

Should one domain (we have catalogs based on domain) be in one storage account, but then have dev, test and prod catalogs in different containers?

This is a good pattern.

Should data be separated based on the requirements for retention and backup?

Not necessarily, but it can be done. With UC, data retention and backup will mostly rely on your cloud storage retention policies/backup policies. Databricks itself allows for table-level short-term backups (https://docs.databricks.com/en/delta/history.html) while also always respecting the cloud storage policies.

Or should we separate data on schemas (different containers or storage accounts?)?

This is a good pattern but you must use a single storage account for storing these schemas. 

  1. If a location has been provided for mySchema, it will be stored there.
  2. If not, and a location has been provided on myCatalog, it will be stored there.
  3. Finally, if no location has been provided on myCatalog, it will be stored in the location associated with the my-region-metastore.

Should some schemas not reside in the same storage account as the catalog?

I don't think that it's possible, but you can have external tables within another storage accounts stored under one of your catalog's schemas. You can also have external volumes (also in separate storage accounts) for storing/fetching files in your Unity Catalog.

Final observations:

Let's say you have a storage account no.1 and no.2. Then you choose no.1 to create your metastore and you create your dev catalog there.

But, you do have some tables in storage account no.2 that you'd like to use in your UC dev catalog stored in storage account no.1.

If this is the case, then you'll be creating an external table on your dev catalog pointing to storage account no.2. But what happens with your data and metadata?

Data -> Stored under storage account no.2

Metadata -> Stored under storage account no.1 (dev catalog in this example)

Feel free to ask any further questions, if my response addresses your concerns then please mark it as the official solution 🙂 

Thanks!

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!