cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
eeezee
Databricks Employee
Databricks Employee

If you are using Databricks to manage your data and haven't fully upgraded to Unity Catalog, you are likely dealing with legacy datasets in the Hive Metastore.  While Unity Catalog and Delta Sharing make it easy to share data across workspaces, sharing Hive Metastore data across workspaces requires an alternative approach: Databricks to Databricks Federation, currently in public preview.

In this article, we'll review the ways to share data across workspaces with Unity Catalog, Delta Sharing, and Lakehouse Federation.

Sharing data with Unity Catalog Metastore

For Databricks deployments with Unity Catalog enabled, catalogs using the same metastore can be shared to different workspaces.  In the following diagram, catalog Y is shared to workspace Y and workspace X. This will happen as long as you haven’t assigned them to a specific workspace like with catalog X.

Sharing across Workspaces with Unity CatalogSharing across Workspaces with Unity Catalog

 

However, in this scenario the Hive Metastores in workspace X and Y are not exposed to each other.  You cannot access the datasets from within them across workspaces.

Sharing with Delta Sharing

Delta Sharing is ideal for cross-region and cross-cloud data sharing, as well as sharing with other stakeholders outside the business. You can leverage Delta Sharing to share a number of assets, including datasets, with another workspace. With the built-in Delta Sharing capability, the producer workspace needs Unity Catalog enabled to share the assets, while the consumer workspace doesn't. However, even with both workspaces having Unity Catalog enabled, you are unable to see the Hive Metastore catalog when selecting assets to add to the share.

 

Sharing data across workspaces with Delta SharingSharing data across workspaces with Delta Sharing


How to share legacy datasets with Lakehouse Federation

Given the limitations of Unity Catalog and Delta Sharing, the question remains: how do you share legacy datasets?  Previously, sharing legacy datasets involved more complex options like upgrading to Unity Catalog, using 3rd party tools, or coding up a custom solution.  All of these solutions take time and effort. We recommend investing time in upgrading to Unity Catalog, but if that isn’t an option right now, then Databricks to Databricks Lakehouse Federation offers a solution.  Here's how it works:

  • Producer workspace
    Unlike other methods, your data source (producer workspace) doesn't require Unity Catalog. This provides flexibility for workspaces with existing Hive Metastore setups. It is important to note however that you will need a cluster running on the producer side to accept the queries being sent.
  • Consumer workspace
    The Workspace receiving the data (consumer workspace) needs Unity Catalog enabled. This allows you to do the setup of Lakehouse Federation. 

Databricks to Databricks sharing with Lakehouse FederationDatabricks to Databricks sharing with Lakehouse Federation

Setting up Databricks to Databricks Lakehouse Federation

First, create a personal access token on the producer workspace, belonging to a service account with permission on the datasets. Depending on whether Unity Catalog is enabled, these permissions will be at an account level or Workspace level.

Next, switch to the consuming workspace. Here you can set up Lakehouse Federation via the Catalog explorer. Select Databricks as the connection type and add in details of the producer workspace and cluster:

Setting up Lakehouse Federation in the Catalog ExplorerSetting up Lakehouse Federation in the Catalog Explorer

Finally, you can create a foreign catalog in the consumer workspace using the connection you set up and referencing the Hive Metastore:

eeezee_5-1703165446219.png eeezee_0-1703165737258.png

Now you can directly access the schemas and datasets in the Hive Metastore of the connected workspace!

Conclusion

With this architecture, sharing legacy datasets becomes simple. In essence, Databricks Lakehouse Federation presents a compelling solution for easily sharing legacy datasets, bridging the gap between your current Hive Metastore and the future of Unity Catalog.

Further reading: 

1 Comment
mattiazeni
Databricks Employee
Databricks Employee

Great article! This solution really helps migrating legacy situations to Unity Catalog.