Databricks Community

SepidehEb · ‎04-03-2024

Unity Catalog (UC) is Databricks unified governance solution for all data and AI assets on the Data Intelligence Platform. UC is central to implementing MLOps on Databricks as it is where all your assets reside and are governed. Therefore, using UC is a prerequisite for all of the practices we recommend in the MLOps Gym series. In this article, we will give you recommendations on how to organize UC properly to be scalable and future-proof. We will address questions such as:

How should you organize use cases/projects?
How should you organize raw data, features, models, and assets of each use case?
What privileges should each role in the team have?
Do you need a playground area for development?
Should you leverage managed or external tables?
How should you access your ML assets from outside Databricks, e.g., AzureML or Amazon SageMaker?

Unity Catalog Overview

Centralizing governance on Databricks with Unity Catalog brings a multitude of benefits.

Your users and groups are no longer bound to the workspace —they're managed centrally at the account level (Check out the identity federation documentation). This means that managing access permissions across different workspaces is simpler too. In this setup, you can create your assets in any workspace and use them in others, provided the right permissions have been granted.

Before moving on to the next section, it is crucial that you understand the UC terminology including metastores, catalogs, tables, etc. Unity Catalog’s documentation and the Big Book of MLOPs v2 provide detailed definitions of all the terms in a generic sense and in the context of MLOps respectively.

Managed vs External

Tables and volumes can be managed or external. Data meant to be processed on and accessed only within Databricks best be stored in managed tables/volumes. Managed tables are low maintenance and enjoy enhanced performance.

When working with managed tables/volumes, one caveat is that their content is not accessible to outside tools without going through a Databricks cluster or SQL warehouse. In the context of MLOps, this point becomes particularly pertinent as an ML pipeline encompasses data from different sources and might serve different downstream applications.

Depending on your architecture, you might need direct access to the data by external tools such as AWS SageMaker or AzureML. In such cases, you should store your assets in external tables/volumes on UC.

How to organize your business assets in UC?

Now that you are familiar with the different objects in UC, let’s examine what decisions you have to make to organize your data and AI assets. When working with a three-level namespace, you have to think about how to organize your content in the following three dimensions.

Team

The size and complexity of your organization have an important role in designing your UC layout.

At one end of the spectrum - e.g. in large organizations or where different teams are operating completely independently from one another out of choice or due to regulations - customers could choose a high level of segregation between different teams’ data and AI assets, and therefore have a separate catalog/schema per team. In this case each team - including the data science team - has permission to work with their dedicated catalog/schema and is completely cut off from other teams’ assets.

At the other end of this spectrum - e.g. in smaller organizations - customers might forgo separation per team and allow all teams to work across shared catalogs/schemas with the right access control in place.

Business Context

The business context of your projects is another dimension that you should take into consideration when designing the layout of your UC. In some businesses, data and AI assets associated with one project or business unit must not leak into any others leading to customers having a separate schema per use case. Some other projects might be transient in nature or might share resources with other projects and not require their own separate schema. An example is a one-off analysis project that resides in a shared “experiments” schema for all such analyses.

Environment Scope

Depending on the number of environments you use in your development cycle, you can choose to use catalogs to separate different “quality” of data and AI assets. In the context of MLOps, the data and models you have in your lower environments such as your development environment, are by design lower in quality and more experimental. As you progress in your MLOps lifecycle, you test them and you gradually promote them to higher environments, the highest of which is your production environment.

As an example, if you have 4 environments, sandbox, dev, staging, and prod, then you probably want to have a separate catalog per environment to have the highest level of isolation between high-quality and low-quality assets.

Example UC Setup Patterns for MLOps

Now let’s have a look at a few different ways that you can organize your data and AI assets in UC

Scenario 1

Description

Organizations where different teams and business units have the flexibility to share data and models with each other. For instance, multiple business units with their own data science and BI teams can collaborate and exchange common data and features, fostering innovation across the organization.

This scenario is thoroughly described in the Big Book of MLOps v2.

Catalog:

One per environment, Sandbox, Dev, Staging, Prod.
The medallion schemas (Bronze, Silver, Gold) are replicas of one another. For simplicity, only the content of the Dev catalog is displayed below.

Schema:

In the Sandbox catalog: one schema per team/ project
In the Dev/Staging/Prod catalogs: One schema per Bronze/Silver/Gold and then more schemas per use case/project

Assets:

Data tables in Bronze/Silver/Gold schemas
Generic feature tables - reusable across different use cases - in Gold schema
Use case specific features as well as other use case specific assets such as models, functions and volumes in the use case schema

Scenario 2

Description

Organizations where business units have to segregate their data and AI assets due to regulation or other reasons. For instance, if you are working in an organization with business units that are completely independent in their data and functionality such as two brands of an umbrella company this scenario applies to you. Another instance is when a particular sector of the business operates under strict regulations and handles confidential data, necessitating isolation and heightened security measures to safeguard it from the rest of the company's data.

Catalog:

One per business unit and environment, BU-Sandbox, BU-Dev, BU-Staging,BU-Prod.
The medallion catalogs(Bronze, Silver, Gold) are replicas of one another. For simplicity, only the content of the Dev catalog is displayed below.

Schema:

In Sandbox catalog: one schema per team/ project
In Dev/Staging/Prod catalogs: One schema per Bronze/Silver/Gold and then more schemas per use case/project

Assets:

Data tables in Bronze/Silver/Gold schemas
Generic feature tables - reusable across different use cases - in Gold schema
Use case specific features as well as other use case specific assets such as models, functions and volumes in the use case schema

Who should have access to what and to what degree?

Following the principle of least privilege as generally recommended by Databricks. users should only have access to assets that they absolutely require and with the minimum privilege required. Many of our customers choose to completely close off their Prod environment to individual users for modification to avoid human error. Instead they only allow Service Principals to operate their jobs in Production.

But wouldn’t this cause a problem with data discoverability? Not necessarily. You are able to grant “BROWSE” privileges to all users on assets you wish to be discoverable. The BROWSE privilege allows users to see the metadata and therefore discover assets but not be able to access the assets themselves.

Another best practice is to grant privileges to groups rather than individual users. For example, if you have a team of BI analysts, create a group and add all members to it, then assign privileges to that group instead of the users one by one. In this case, if somebody new joins or someone leaves the team all you need to do is to add/remove them to that group instead of granting them access to all different assets individually.

Summary

In conclusion, setting up Unity Catalog for MLOps offers a flexible, powerful way to manage data and ML assets across diverse organizational structures and technical environments. Unity Catalog's design supports a variety of architectures, enabling direct data access for external tools like AWS SageMaker or AzureML through the strategic use of external tables and volumes. It facilitates tailored organization of business assets that align with team structures, business contexts, and the scope of environments, offering scalable solutions for both large, highly segregated organizations and smaller entities with minimal isolation needs. Moreover, by adhering to the principle of least privilege and leveraging the BROWSE privilege, Unity Catalog ensures that access is precisely calibrated to user needs, enhancing security without sacrificing discoverability. This setup not only streamlines MLOps workflows but also fortifies them against unauthorized access, making Unity Catalog an indispensable tool in modern data and machine learning operations.

Check out the UC best practices on Azure and AWS to find out more about how to optimize your UC setup.

Coming up next!

Next blog in this series: MLOps Gym - Beginners Guide to Cluster Configuration for MLOps

Databricks Community

MLOps Gym - Unity Catalog Setup for MLOps