cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dgomezm
Databricks Employee
Databricks Employee

Throughout the dozens of engagements I’ve had since joining Databricks, I’ve found that customers often struggle to understand the scope and concept of Unity Catalog. Questions like “Does it store my data?”, “Is it safe?”, “Can I have multiple Unity Catalogs?”, and “Will it break anything?”, are more common than I’d like. We present it as a one-size-fits-all solution—and it truly is—but getting started can feel intimidating due to its scope and capabilities.

For customers coming from legacy architectures, this shift is even more significant, as it turns their understanding of how things work on its head, introducing them to a whole new way to govern their data.

I’ve always found it useful to grasp the ‘why’ before diving into the ‘how’, so I’ll do my best to break down what Unity Catalog is (and isn’t).

A short history lesson. 


The hive_metastore originally emerged from the Hadoop and Hive ecosystem as a metadata repository for managing data objects and enabling efficient querying. It became the default metadata repository for the Databricks platform, managing data objects and permissions within each workspace. While this approach suited the needs of early Hadoop-based architectures, it began showing limitations in today’s cloud-native, multi-workspace setups. 

As implemented in Databricks, hive_metastore was designed as a workspace-level construct for managing metadata and permissions. This setup worked well for many, but if you had, say, a hundred workspaces sharing a single data source, you’d end up managing a hundred individual sets of permissions across those workspaces. (Imagine trying to juggle a hundred different keys for a single door—not exactly efficient or fun!) While some customers turned to external hive_metastores for added flexibility, the lack of centralized governance made scaling permissions more challenging within Databricks. 

workspace_diagram.png

And then, there was the matter of the two-level namespace restriction. This setup made managing permissions at scale more of a headache than needed. Picture this: a schema with a hundred tables beneath it, each serving different use cases. Your options? Either grant access to the entire schema, exposing all its tables, or meticulously assign access to each table one by one. Not exactly a recipe for scalable success.

access_permissions_diagram.png

Managing access to underlying storage brought its own set of challenges, especially with dbfs, the default storage location for hive_metastore. Access to external object storage was handled through instance profiles, which operated at the cluster level rather than the user level. This meant that all users on a given cluster shared the same access permissions, regardless of individual needs. Without user-specific permissions, enforcing precise security controls was difficult, resulting in a setup that lacked the flexibility needed for fine-tuned data governance.

And let’s not forget that hive_metastore was showing its age in other ways, too. It lacked key features like data lineage, access patterns, and data discovery — capabilities crucial for centralized governance solutions expected in today’s data-driven environments.


 

This is where Unity Catalog steps in. Unlike its predecessor, Unity Catalog is an account-level construct, allowing metadata and permissions to be shared across multiple workspaces. This shift enables centralized governance at scale by making data management more streamlined and efficient.

In short, while hive_metastore had its time in the spotlight, Unity Catalog represents a significant leap forward, offering the flexibility, security, and scalability that modern data environments require.

 

Metastore


metastore_diagram.png

Alright, now that we’ve got the background covered, let’s dive into the nuts and bolts of getting Unity Catalog up and running. If you’ve heard someone talking about “enabling UC” and felt a bit lost, don’t worry — you’re not alone. What they’re really talking about is creating a metastore.

So, what exactly is a metastore? Think of it as the central directory behind your data ecosystem—coordinating where everything lives and managing permissions across workspaces—but it doesn’t actually store your data.

Unlike traditional metastores, Unity Catalog adds an additional level to the namespace—the catalog. This additional layer is what allows you to further granularize your organization and permissions, making data governance at scale easier and more flexible than ever. For example, having three catalogs,devstgprod, with identical schema and table structures allows easy testing without code changes.

You’re probably asking yourself, “Will attaching a metastore to my workspace disable hive_metastore or break anything?” The answer is no. Unity Catalog and hive_metastore can co-exist without any issues. Your clusters can keep running in their legacy modes, your data will stay intact, and your job references won’t suddenly switch over. Think of it as adding a new tool to your toolbox—not a forced replacement. This approach makes it easier to transition gradually, giving you time to fully leverage Unity Catalog’s centralized capabilities without any immediate impact on your existing workflows. 

Creating and configuring the metastore

 

Disclaimer: New workspaces are enabled on Unity Catalog by default. Find more here.

With that out of the way, we can move on to creating the metastore. The only requirement is that you must be an account administrator, and any workspace you plan to enable must be enrolled on the Premium plan or above.

On your account console, click on Catalog and then select Create Metastore

In this screen, you only need to do provide a name and select the region where your workspace resides. Remember that Unity Catalog is tied to a cloud region and only stores metadata, your data’s blueprint.

There’s an option to add storage configurations to choose the default storage location for the metastore, but I’d highly recommend skipping this for now. Adding this might lead you to lose track of where your data is stored, making it harder to manage later on.

With the metastore in place, the next step is to attach it to your workspaces. Ignore the pop-up messages and march ahead like a fearless explorer — you’re not defusing a bomb here, so don’t worry, nothing’s going to blow up.

And just like that, your workspace is enabled on Unity Catalog! But wait — where does your data get stored? We’ll work out the details in the next post. Stay tuned!

Want to check out more posts?

Check out my blog here!