If you are using Databricks to manage your data and haven't fully upgraded to Unity Catalog, you are likely dealing with legacy datasets in the Hive Metastore. While Unity Catalog and Delta Sharing make it easy to share data across workspaces, sharing Hive Metastore data across workspaces requires an alternative approach: Databricks to Databricks Federation, currently in public preview.
In this article, we'll review the ways to share data across workspaces with Unity Catalog, Delta Sharing, and Lakehouse Federation.
Sharing data with Unity Catalog Metastore
For Databricks deployments with Unity Catalog enabled, catalogs using the same metastore can be shared to different workspaces. In the following diagram, catalog Y is shared to workspace Y and workspace X. This will happen as long as you haven’t assigned them to a specific workspace like with catalog X.
Sharing across Workspaces with Unity Catalog
However, in this scenario the Hive Metastores in workspace X and Y are not exposed to each other. You cannot access the datasets from within them across workspaces.
Sharing with Delta Sharing
Delta Sharing is ideal for cross-region and cross-cloud data sharing, as well as sharing with other stakeholders outside the business. You can leverage Delta Sharing to share a number of assets, including datasets, with another workspace. With the built-in Delta Sharing capability, the producer workspace needs Unity Catalog enabled to share the assets, while the consumer workspace doesn't. However, even with both workspaces having Unity Catalog enabled, you are unable to see the Hive Metastore catalog when selecting assets to add to the share.
Sharing data across workspaces with Delta Sharing
How to share legacy datasets with Lakehouse Federation
Given the limitations of Unity Catalog and Delta Sharing, the question remains: how do you share legacy datasets? Previously, sharing legacy datasets involved more complex options like upgrading to Unity Catalog, using 3rd party tools, or coding up a custom solution. All of these solutions take time and effort. We recommend investing time in upgrading to Unity Catalog, but if that isn’t an option right now, then Databricks to Databricks Lakehouse Federation offers a solution. Here's how it works:
- Producer workspace
Unlike other methods, your data source (producer workspace) doesn't require Unity Catalog. This provides flexibility for workspaces with existing Hive Metastore setups. It is important to note however that you will need a cluster running on the producer side to accept the queries being sent.
- Consumer workspace
The Workspace receiving the data (consumer workspace) needs Unity Catalog enabled. This allows you to do the setup of Lakehouse Federation.
Databricks to Databricks sharing with Lakehouse Federation
Setting up Databricks to Databricks Lakehouse Federation
First, create a personal access token on the producer workspace, belonging to a service account with permission on the datasets. Depending on whether Unity Catalog is enabled, these permissions will be at an account level or Workspace level.
Next, switch to the consuming workspace. Here you can set up Lakehouse Federation via the Catalog explorer. Select Databricks as the connection type and add in details of the producer workspace and cluster:
Setting up Lakehouse Federation in the Catalog Explorer
Finally, you can create a foreign catalog in the consumer workspace using the connection you set up and referencing the Hive Metastore:
Now you can directly access the schemas and datasets in the Hive Metastore of the connected workspace!
Conclusion
With this architecture, sharing legacy datasets becomes simple. In essence, Databricks Lakehouse Federation presents a compelling solution for easily sharing legacy datasets, bridging the gap between your current Hive Metastore and the future of Unity Catalog.
Further reading: