Databricks Community

MurtN · ‎07-24-2024

Over the past year we’ve seen huge adoption of Unity Catalog, and we've noticed certain considerations that are key to all successful deployments. This blog focuses on how to set up a key artifact in Unity - the catalog - and the things you should be thinking about when creating them!

What is a catalog?

Catalogs are the first layer in the data hierarchy, used to organize your data assets. A catalog contains schemas (databases), and a schema contains tables, views, volumes, models, and functions.

Screenshot 2024-07-01 at 11.57.44 AM.png
You find catalogs in the Catalog Explorer UI!

Logically, catalogs contain schemas, which in turn contain your data assets such as tables and models.

Catalogs should map to high level data domains in your organization and also be used to separate data that is used in development, test, and production environments.

Take for example a company that sells solar powered products - catalogs could be used to separate telemetry data from each of the company’s products and be specific to dev/test/prod, e.g. solar_lights_prod, solar_lights_dev, solar_panels_prod, etc.

Now that we know what a catalog is, here are 4 things you should consider when setting them up!

4 things to consider when creating a catalog

You can follow along with the below steps directly within the ‘Create Catalog’ UI, found within the Catalog Explorer.

Screenshot 2024-07-01 at 11.50.48 AM.png

Picking the right catalog type

Screenshot 2024-07-01 at 11.55.19 AM.png

The first consideration when creating a catalog is catalog ‘type’. There are 3 possible types of catalogs that can be created in UC: Standard, Foreign, and Shared.

Standard catalogs are your ‘bread and butter’ catalog - used to hold tables, views, models and other data and AI assets. When creating a standard catalog, you must specify a storage location for the catalog. A storage location specifies a location in cloud object storage (S3/ADLS/GCS) for storing data for managed tables that you add to the catalog.

You can now specify “Default storage” (currently in preview) as the managed storage location for the Standard catalog. Using default storage, you get a new catalog instantaneously with out-of-the-box security, unified data access, and cost observability without the overhead of managing cloud infrastructure. Click here to sign up for the private preview.

Foreign catalogs mirror a database in an external data system such as Postgres, Redshift, Snowflake, etc. They enable users to perform read-only queries on that data system right from Databricks workspace! You can apply all of Unity’s fine grained security on top of this data, as well as track lineage of how this data is used.
Shared catalogs allow you to access data shared with you by other organizations. This includes any external data shared with your organization through the Delta Sharing protocol, including Marketplace. You can also use shared catalogs to access data between regions in your own account.

Take the example once again of the company that sells solar-powered equipment:

Admins could set up standard catalogs to hold telemetry data ingested into Databricks from sensors attached to their products.
Foreign catalogs could be configured to make sales data of these products that live in an external system such as Postgres available to users in Databricks.
Shared catalogs could be leveraged to make data from the Databricks Marketplace related to general weather measurements provided by meteorological agencies or weather services available to users.

All these datasets are now made accessible to users in one place and can simply be joined and queried through Unity Catalog! Furthermore, admins can use a single governance model to control who has access to what data.

Configuring workspace-catalog bindings

Screenshot 2024-07-01 at 11.52.19 AM.png

The second consideration has to do with the Workspaces that you want the catalog to be accessed from.

As a refresher, a Workspace is the environment users enter to run their workloads. It is common to have a Workspace for development and production use cases, as well as separating per business unit.

Going back to the example of the company that sells solar-powered products, say you configured different catalogs to hold telemetry data from various products and divided it into data that should be used in development, testing, and production Workspaces. To ensure that production data held in the solar_panels_prod catalog can only ever be accessed from production Workspaces, data owners can use a feature called Workspace-catalog binding. This feature allows catalog owners to specify an allow-list of Workspaces from which their catalog can be accessed.

Note that data access is still governed by user level permissions. If a user is given access to a table solar_lights_prod.weather.actuals_daily, then catalog bindings simply dictates which Workspaces the user can exercise that data access from.

Workspace-catalog bindings allow you to specify an allow-list of Workspaces that are trusted to access a catalog.

Permissioning the catalog

Screenshot 2024-07-01 at 11.52.01 AM.png

The third consideration has to do with the users, groups, and service principals that have permissions to operate on the catalog.

As an admin, you will want to determine:

Who can create assets? Usually includes data engineers and service principals who are running ETL pipelines to bring in and curate data in Databricks. Also you should consider a ‘scratch’ space for other users to write intermediate tables and results.
Who can read data? Data scientists and analysts who are querying this data for various insights and development purposes.
Who can view metadata? General users who can search for/discover data to know it exists, and view lineage.

We strongly encourage using groups instead of users for assigning access to data secured in catalogs. This helps simplify access provisioning as every individual user doesn’t need to be granted access and can simply be added to the established group that has privileges on a data product.

Groups should always be synchronized with the system they are managed in via the SCIM API (standard REST APIs for users/group synchronization). This can be accomplished using an Identity Provider (IdP), such as OKTA or Azure Active Directory.

Let’s once again look at the example of the company that sells solar-powered products. A catalog has been configured called solar_lights_prod to hold all the telemetry data collected from this particular line of products. An admin starts out by creating 3 schemas in this catalog called bronze, silver, and gold based on the lakehouse medallion architecture. Additionally, a “sandbox schema” is created to house ad-hoc asset creation. Groups can be leveraged here to configure who can do what. See the example below of reader/writer groups being used to provision access:

Permissioning setup example using groups and medallion architecture in Unity Catalog

We encourage as a best practice for admins to grant the BROWSE metadata privilege to all users! Granting BROWSE on a catalog allows users to see metadata for its child objects in catalog explorer, without being able to read (SELECT) the actual data. Additionally, users can see the data lineage for these objects.

BROWSE allows users to see metadata for objects in a catalog encouraging data discovery

BROWSE makes data lineage accessible so users can better understand data flows

Enriching the catalog’s metadata

Screenshot 2024-07-01 at 11.51.33 AM.png

The fourth and final consideration has to do with adding metadata for the catalog. Metadata includes tags and comments. Adding metadata is crucial to organize datasets in Unity and make it easier for workspace users to find the right data they are looking for. Furthermore, metadata helps provide in-product AI assistants greater context for more informed answers.

Comments support basic markdown syntax and can be used to add descriptions and links to enable easier data documentation and discovery for users.

Markdown comments can be leveraged to give users info about data stewards for questions about datasets and links for how to gain access.

Tags are attributes containing keys and optional values that you can apply to securable objects in Unity Catalog. Tagging is useful for organizing and categorizing different securable objects within a metastore. Using tags also simplifies search and discovery of your data assets.

Add tags to classify data and indicate things such as data source and sensitivity level.

Users can search by tags to quickly find relevant datasets.

Get started today!

In this blog, we’ve discussed the 4 main considerations when setting up a catalog object and the best practices that go along with it. Try it out yourself by creating a catalog today!

Databricks Community

You should know these 4 things about setting up Unity Catalog!

What is a catalog?

4 things to consider when creating a catalog

Picking the right catalog type

Configuring workspace-catalog bindings

Permissioning the catalog

Enriching the catalog’s metadata

Get started today!

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)