Over the past year we’ve seen huge adoption of Unity Catalog, and we've noticed certain considerations that are key to all successful deployments. This blog focuses on how to set up a key artifact in Unity - the catalog - and the things you should be thinking about when creating them!
Catalogs are the first layer in the data hierarchy, used to organize your data assets. A catalog contains schemas (databases), and a schema contains tables, views, volumes, models, and functions.
You find catalogs in the Catalog Explorer UI!
Logically, catalogs contain schemas, which in turn contain your data assets such as tables and models.
Catalogs should map to high level data domains in your organization and also be used to separate data that is used in development, test, and production environments.
Take for example a company that sells solar powered products - catalogs could be used to separate telemetry data from each of the company’s products and be specific to dev/test/prod, e.g. solar_lights_prod, solar_lights_dev, solar_panels_prod, etc.
Now that we know what a catalog is, here are 4 things you should consider when setting them up!
You can follow along with the below steps directly within the ‘Create Catalog’ UI, found within the Catalog Explorer.
The first consideration when creating a catalog is catalog ‘type’. There are 3 possible types of catalogs that can be created in UC: Standard, Foreign, and Shared.
You can now specify “Default storage” (currently in preview) as the managed storage location for the Standard catalog. Using default storage, you get a new catalog instantaneously with out-of-the-box security, unified data access, and cost observability without the overhead of managing cloud infrastructure. Click here to sign up for the private preview.
Take the example once again of the company that sells solar-powered equipment:
All these datasets are now made accessible to users in one place and can simply be joined and queried through Unity Catalog! Furthermore, admins can use a single governance model to control who has access to what data.
The second consideration has to do with the Workspaces that you want the catalog to be accessed from.
As a refresher, a Workspace is the environment users enter to run their workloads. It is common to have a Workspace for development and production use cases, as well as separating per business unit.
Going back to the example of the company that sells solar-powered products, say you configured different catalogs to hold telemetry data from various products and divided it into data that should be used in development, testing, and production Workspaces. To ensure that production data held in the solar_panels_prod catalog can only ever be accessed from production Workspaces, data owners can use a feature called Workspace-catalog binding. This feature allows catalog owners to specify an allow-list of Workspaces from which their catalog can be accessed.
Note that data access is still governed by user level permissions. If a user is given access to a table solar_lights_prod.weather.actuals_daily, then catalog bindings simply dictates which Workspaces the user can exercise that data access from.
Workspace-catalog bindings allow you to specify an allow-list of Workspaces that are trusted to access a catalog.
The third consideration has to do with the users, groups, and service principals that have permissions to operate on the catalog.
As an admin, you will want to determine:
We strongly encourage using groups instead of users for assigning access to data secured in catalogs. This helps simplify access provisioning as every individual user doesn’t need to be granted access and can simply be added to the established group that has privileges on a data product.
Groups should always be synchronized with the system they are managed in via the SCIM API (standard REST APIs for users/group synchronization). This can be accomplished using an Identity Provider (IdP), such as OKTA or Azure Active Directory.
Let’s once again look at the example of the company that sells solar-powered products. A catalog has been configured called solar_lights_prod to hold all the telemetry data collected from this particular line of products. An admin starts out by creating 3 schemas in this catalog called bronze, silver, and gold based on the lakehouse medallion architecture. Additionally, a “sandbox schema” is created to house ad-hoc asset creation. Groups can be leveraged here to configure who can do what. See the example below of reader/writer groups being used to provision access:
Permissioning setup example using groups and medallion architecture in Unity Catalog
We encourage as a best practice for admins to grant the BROWSE metadata privilege to all users! Granting BROWSE on a catalog allows users to see metadata for its child objects in catalog explorer, without being able to read (SELECT) the actual data. Additionally, users can see the data lineage for these objects.
BROWSE allows users to see metadata for objects in a catalog encouraging data discovery
BROWSE makes data lineage accessible so users can better understand data flows
The fourth and final consideration has to do with adding metadata for the catalog. Metadata includes tags and comments. Adding metadata is crucial to organize datasets in Unity and make it easier for workspace users to find the right data they are looking for. Furthermore, metadata helps provide in-product AI assistants greater context for more informed answers.
Comments support basic markdown syntax and can be used to add descriptions and links to enable easier data documentation and discovery for users.
Markdown comments can be leveraged to give users info about data stewards for questions about datasets and links for how to gain access.
Tags are attributes containing keys and optional values that you can apply to securable objects in Unity Catalog. Tagging is useful for organizing and categorizing different securable objects within a metastore. Using tags also simplifies search and discovery of your data assets.
Add tags to classify data and indicate things such as data source and sensitivity level.
Users can search by tags to quickly find relevant datasets.
In this blog, we’ve discussed the 4 main considerations when setting up a catalog object and the best practices that go along with it. Try it out yourself by creating a catalog today!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.