Databricks

MetaRossiVinli · ‎03-30-2023

I am setting up a new workspace that will use the Unity Catalog. I want all data stored in the Unity Catalog in the following catalogs: dev, staging, prod. I want to prevent users from accidentally reading and writing data elsewhere.

For the above situation, can I hide and/or delete the following default catalogs?

hive_metastore
main
samples
system

Anonymous · ‎03-31-2023

@Kevin Rossi Unfortunately hive_metastore can't be hidden as of now. It's not needed for UC, but a Databricks workspace doesn't work well without the default RDS connections which require changes in the way DBR/Spark starts up. Eventually we will have a UC-only workspace with no references to HMS, but that doesn't exist today. (Eng is working on it).

Here are the couple of things as. a workaround.

Configure the default catalog from hive_metastore to another catalog using "spark.databricks.sql.initial.catalog.name" property.

Default catalog can also be set while assigning the workspace to a metastore. If it's already assigned, unassign and reassign the workspace with a default catalog.

samples and system catalogs are read only catalogs, they can't be removed.

Regarding main catalog, we have a feature request called catalog to workspace binding. Be default, a catalog is bound to all workspaces, but using this feature we can bind the catalog only to the desired workspaces. In this case. If we disable all workspace access to main catalog, then it won't be visible on all workspaces. Please reach out to your Databricks contact to onboard your account to this feature.

View solution in original post

Anonymous · ‎03-31-2023

@Kevin Rossi Unfortunately hive_metastore can't be hidden as of now. It's not needed for UC, but a Databricks workspace doesn't work well without the default RDS connections which require changes in the way DBR/Spark starts up. Eventually we will have a UC-only workspace with no references to HMS, but that doesn't exist today. (Eng is working on it).

Here are the couple of things as. a workaround.

Configure the default catalog from hive_metastore to another catalog using "spark.databricks.sql.initial.catalog.name" property.

Default catalog can also be set while assigning the workspace to a metastore. If it's already assigned, unassign and reassign the workspace with a default catalog.

samples and system catalogs are read only catalogs, they can't be removed.

Regarding main catalog, we have a feature request called catalog to workspace binding. Be default, a catalog is bound to all workspaces, but using this feature we can bind the catalog only to the desired workspaces. In this case. If we disable all workspace access to main catalog, then it won't be visible on all workspaces. Please reach out to your Databricks contact to onboard your account to this feature.

MetaRossiVinli · ‎03-31-2023

Short answers that I derived from the above:

hive_metastore - This cannot be deleted or hidden, but the default catalog can be changed with the above instructions.
main - There is a new feature that can unbind catalogs from workspaces. This would remove access as I desire. TODO request our account to be onboarded for this.
samples - Read only and cannot be removed.
system - Read only and cannot be removed.

OK, cool, thanks. I think that will enable me to effectively govern our users as desired. I am a fan of keeping everything cleanly separated. We are going have two workspaces for our team:

research for demos, dabbling, and testing new Databricks features
prod for production code/notebooks that are vetted through Git PRs and use dev/staging/prod branches

Going forward, I would support features that enable data science teams govern production pipelines in a clean manner. Removing unneeded databases/catalogs and improved management of pipelines would be favorable in my opinion. I think everything that we need to implement this exists now.

Features like the 'catalog to workspace binding' help keep concerns separated; i.e. exposing a research catalog to only our research workspace and preventing access to that catalog in the prod workspace. This feature will prevent us from accidentally writing to the research catalog from a prod pipeline; we will also enforce this with permissions... but I like redundancy.

Avvar2022 · ‎04-18-2023

@Kevin Rossi @John Lourdu - I am also new to databricks setting up environment.

Bu default "all users" have read access to below mentioned catalogs,

my question is - i see an option to revoke read access, is it must have read access to all these catalogs to "all users". Can i revoke will there be any impact?

main - by default "all users" have read access, i see option to revoke access. if i revoke access will there be any impact.
samples - by default "all users" have read access, i see option to revoke access. if i revoke access will there be any impact.
system - by default "all users" have read access, i see option to revoke access. if i revoke access will there be any impact.

Databricks

In a new workspace without any data using the Unity Catalog, can I hide/delete the hive_metastore, main, samples, and system catalogs?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs