cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unity catalog implementation

prad18
New Contributor III

Hello Databricks Community,

We are in the process of planning a Unity Catalog implementation for our organization, and I'd like to seek input on some architectural decisions. We're considering various approaches to workspace separation, storage account allocation, and Azure subscription management. I'd greatly appreciate insights from those who have experience with similar implementations.

I have already done some research through most of the Data + AI Summit videos, Databricks blogs & best practices but haven't seen anything detailed implementation regarding unity catalog.

Here are the key questions we're grappling with:

1. Workspace Separation Strategy : 

Development and Staging environments -> Single Workspace
Production environment -> Separate Workspace
What are the advantages and disadvantages of this approach?

2. Storage Account Allocation :

Should we use separate Azure storage accounts for:

  1. Unity Catalog metastore
  2. Actual data (landing zone)

What are the implications of this approach, and what factors should we consider in making this decision?

3. Azure Subscription and Metastore Architecture :

Should we maintain separate Azure subscriptions for Development and Production environments? ( i think we should)
If we opt for separate subscriptions:
a) Where should we create the Unity Catalog metastore(s)?
b) Should we use a single metastore in the Production subscription, or create separate metastores for each environment (one in Production, one in Development)?

What are the pros and cons of each approach, considering factors such as data governance, security, and operational efficiency?"

If you've tackled similar decisions in your Unity Catalog implementation, what approach did you take? What worked well, and what challenges did you face?

Thank you in advance for your insights and advice!"

1 ACCEPTED SOLUTION

Accepted Solutions

filipniziol
Contributor III

Hi @prad18,

I have experience working with Databricks on Azure, and here are some considerations regarding your Unity Catalog implementation:

Important Consideration:
The most important fact on Azure is that you can only have a single metastore per tenant per region. This means that when planning your Unity Catalog implementation, you need to design your architecture to work within this limitation. Let me address your questions, starting from the last one and working upwards.

3. Azure Subscription and Metastore Architecture:

Given the limitation of a single metastore per tenant per region, the architectural decisions around Unity Catalog need to reflect this constraint.

Single Metastore Implementation (Answer to 3.b):
Since you can only have one metastore per tenant per region, you will set up a single metastore. This single metastore will contain multiple catalogs. You will create separate catalogs within the metastore for each environment (e.g., `dev_catalog`, `staging_catalog`, `prod_catalog`). 

2. Storage Account Allocation:
Separate Azure Storage Accounts:

  • Unity Catalog Metastore vs. Actual Data:
    • Metastore Storage Account: Use a dedicated storage account for Unity Catalog metadata. This storage is used to manage the metastoreโ€™s metadata and does not directly interact with your data processing tasks.
    • Data Storage Account: Use separate storage accounts for your actual data. You may create a separate storage per catalog and then separate blob container for landing, bronze, silver and gold.

1. Workspace Separation Strategy:

Keep your workspaces separated. Have a separate workspace for production, a separate workspace for staging and a separate workspace for development. Staging may mirror production so it can contain production data. You do not want to have people doing the development and accidentally run their code targeting production data. Mixing development and staging might cause extra risks. In short best practice is to have separate workspaces per environment.

Summary:

  • Single metastore with multiple catalogs. At least 1 catalog per environment.
  • Separate storage accounts for metadata and data
  • At least 1 workspace per environment

Last but not least: your organization may have multiple projects sharing the same unity metastore.
For example finance department has their data and sales department has their data.
In that case you want to create catalogs per environment per project, so in the example `finance_dev_catalog`, `finance_staging_catalog`, `finance_prod_catalog`, `sales_dev_catalog`, `sales_staging_catalog`, `sales_prod_catalog`. 

Proper naming will make your life easier in future.

View solution in original post

11 REPLIES 11

Brahmareddy
Honored Contributor

Hi @prad18, How are you doing today?

For workspace separation, using a single workspace for Development and Staging, and a separate one for Production, balances isolation and cost-efficiency, but be aware it could complicate promotion processes. For storage account allocation, separating storage accounts for Unity Catalog and data ensures better security and governance, but could add complexity in management. Regarding Azure subscriptions, having separate subscriptions for Development and Production is a good practice for security and cost control. Create the Unity Catalog metastore in the Production subscription, but whether to use a single or separate metastores depends on your governance needsโ€”separate metastores can enhance isolation but might add management overhead.

Let me know if it works well for you.

Regards,

Brahma

prad18
New Contributor III

Hi @Brahmareddy,

Thank you for response. I'm good. How are you? 
We are also thinking the same way as you mentioned for storage, subscription and workspace, keeping them separate. But do you implemented something similar or know someone who has implemented in similar way?

Also, wanted to check with you when mentioned  "it could complicate promotion processes", do you see any issues here or faced already?

So, any details that could helps us in the implementation would be much appreciated ๐Ÿ™‚  

Regards,

Prad18

billieo
New Contributor II

Get more Information about presale crypto

Sujitha
Databricks Employee
Databricks Employee

@prad18 Requesting you to drop an email to help@databricks.com. The relevant team will reach out to you with help.

prad18
New Contributor III

Thank you @Sujitha - Dropped mail to help@databricks.com. Haven't received any response yet.

Sujitha
Databricks Employee
Databricks Employee

@prad18 do you have the ticket number? If not, could you write to help@databricks.com and share the ticket number associated with it?

prad18
New Contributor III

Hi @Sujitha ,

Here is the ticket number we got for the query : #00528553.

Our Databricks plan is pay as you go tier.  

Regards,

Prad18

filipniziol
Contributor III

Hi @prad18,

I have experience working with Databricks on Azure, and here are some considerations regarding your Unity Catalog implementation:

Important Consideration:
The most important fact on Azure is that you can only have a single metastore per tenant per region. This means that when planning your Unity Catalog implementation, you need to design your architecture to work within this limitation. Let me address your questions, starting from the last one and working upwards.

3. Azure Subscription and Metastore Architecture:

Given the limitation of a single metastore per tenant per region, the architectural decisions around Unity Catalog need to reflect this constraint.

Single Metastore Implementation (Answer to 3.b):
Since you can only have one metastore per tenant per region, you will set up a single metastore. This single metastore will contain multiple catalogs. You will create separate catalogs within the metastore for each environment (e.g., `dev_catalog`, `staging_catalog`, `prod_catalog`). 

2. Storage Account Allocation:
Separate Azure Storage Accounts:

  • Unity Catalog Metastore vs. Actual Data:
    • Metastore Storage Account: Use a dedicated storage account for Unity Catalog metadata. This storage is used to manage the metastoreโ€™s metadata and does not directly interact with your data processing tasks.
    • Data Storage Account: Use separate storage accounts for your actual data. You may create a separate storage per catalog and then separate blob container for landing, bronze, silver and gold.

1. Workspace Separation Strategy:

Keep your workspaces separated. Have a separate workspace for production, a separate workspace for staging and a separate workspace for development. Staging may mirror production so it can contain production data. You do not want to have people doing the development and accidentally run their code targeting production data. Mixing development and staging might cause extra risks. In short best practice is to have separate workspaces per environment.

Summary:

  • Single metastore with multiple catalogs. At least 1 catalog per environment.
  • Separate storage accounts for metadata and data
  • At least 1 workspace per environment

Last but not least: your organization may have multiple projects sharing the same unity metastore.
For example finance department has their data and sales department has their data.
In that case you want to create catalogs per environment per project, so in the example `finance_dev_catalog`, `finance_staging_catalog`, `finance_prod_catalog`, `sales_dev_catalog`, `sales_staging_catalog`, `sales_prod_catalog`. 

Proper naming will make your life easier in future.

prad18
New Contributor III

Hi @filipniziol ,

Thank you for providing such a detailed overview of Unity Catalog implementation considerations on Azure Databricks. Your explanation is thorough and addresses key points.

Our approach was also in line with detail you shared.

1. There is still ambiguity of how storage should be used. For example,
if we have used a storage account while creating a metastore , should we have to mention again while creating catalog the same storage account ? or any better approach here?

Please let us know any issues or challenges you faced while keeping multiple storage accounts for metastore, catalog?

2. Any cost or scalability related issues you faced while implementing or maintaining solution?

Regards,

prad18

filipniziol
Contributor III

Hi @prad18 ,

I'm glad the previous response was helpful! Let's address your remaining questions:

  1. Cost Differences Between Single vs. Multiple Azure Storage Accounts: The cost difference between using a single Azure storage account for both Unity Catalog and your data versus multiple storage accounts is generally negligible. The primary cost driver is the amount of data stored. For example, whether you store 100 GB in a single storage account or split it across multiple storage accounts, you will still be paying for 100 GB of storage. 

  2. Specifying Storage When Creating a Catalog: When creating a catalog, you do not need to specify storage if you want to use the default metastore storage. However, if you prefer to store data in a different location from where your metastore is, you need to specify the MANAGED LOCATION. For example: 

 

CREATE CATALOG sample_catalog
MANAGED LOCATION '<azure storage location>';

 

This approach allows you to organize your data storage more effectively, keeping metadata in one storage account (for the metastore) and your actual data in another.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group