Are you using Hive Metastore on Databricks or an external Hive Metastore such as Glue? Do you want to migrate to Unity Catalog (UC), but need help figuring out where to start or what the migration process entails? If your answer is yes, then this article is for you. We will assist you in planning your UC migration and make the process less intimidating by providing you with all the necessary information.
At a high level, several topics need to be addressed as part of the migration journey:
Unity Catalog Metastore setup
Upgrade user management to account level
Understand the current landscape of your Glue or Hive Metastore
Design Unity Catalog architecture
Migration to UC
Cutover your jobs, dashboard, and queries to point to UC
Please refer to the execution/project plan (CLICK HERE) for detailed steps and links to Databricks documentation. The plan will guide you through the migration process with timelines to keep your project on track. Let's review each topic in detail to ensure a clear understanding.
1. Unity Catalog Metastore Setup:
To enable the Unity Catalog in Databricks, it is necessary to set up the Unity Catalog Metastore along with various objects such as Storage Credentials and External Locations. This involves the creation of AWS resources, IAM roles, and policies, as well as corresponding storage credentials and external locations within Databricks.
2. Upgrade User Management to Account Level:
If you have set up your SCIM provisioning at the workspace level using Okta, Entra ID (Azure AD), or other identity providers, it is recommended that you move this setup to the Account level. This will allow you to manage it at the Account level, rather than having to set it up for each workspace you create in Databricks. In addition, it is recommended that you enable SSO at the account level and SSO federation to the workspace (unified login) for a streamlined experience. If you have created groups within Databricks (not managed by the Identity provider), such as by using Terraform or creating ad hoc groups within the Databricks workspace, special steps are required. Please refer to the project plan for specific documentation that will help you accomplish this.
3. Understand the Current Hive Landscape:
In order to determine the complexity of the migration process, it is important to review the complete list of tables and views in your current Hive Metastore. This not only involves understanding the list of databases and tables you have but also the following details:
The file formats (Delta, Parquet, CSV, Hive SerDe, etc.)
Whether tables are managed or external
S3/ADLS paths for external tables
Whether tables are on DBFS, if managed, etc.
Having this information will help you select the appropriate migration pattern for most of the tables. If the file format is Delta/Parquet, you can easily migrate using the SYNC command. For other file formats, you may need to use CTAS (Create table as Select) to UC, or perhaps first convert them to Delta and then use SYNC. Refer to the plan for more details.
4. Design Unity Catalog Architecture:
Defining the segregation of data assets is a crucial phase in the design of UC. During this phase, you will decide how to structure your catalogs and databases (schemas) based on teams, environments (prod, staging, dev), or any other logical approach. You should evaluate the scalability of this data segregation/cataloging design over time and how you plan to manage user permissions (granularity-wise). To design your data landscape in UC, follow the best practices outlined in the Databricks documentation.
5. Migration to Unity Catalog:
The first step is to set up the catalogs and schemas in Unity Catalog, and then migrate the tables to UC. You can accomplish this through the SYNC command for delta/parquet file formats, or other mechanisms for non-delta/parquet file formats. You can refer to the plan for reviewing different migration mechanisms. Once the migration is complete, the next important step is to grant permissions to users, groups, and service principals on UC tables, external locations, and assets/objects to prepare for the next stage.
6. Cutover Your Jobs, Dashboard, and Queries:
Migrating tables to UC is only half the job done. It is important to plan for upgrading your code (queries, jobs, dashboards, etc.) that still point to the HMS tables. Luckily, the SYNC command provides you with enough breathing room to update your code to point to UC assets over time as it ensures that the HMS and corresponding tables in UC are always in sync. You can approach this based on how your org/teams are organized. Two approaches that we have seen success with:
Pipeline by pipeline to ensure your entire pipeline is running on UC tables end-to-end. This approach helps you apply any learnings on dos and don'ts to the next pipeline.
Data layer by layer ( Gold, Silver, and Bronze): You can approach these layers in either direction but we felt targeting the Gold layer first made the process least disruptive. Say something goes wrong when repointing the Bronze layer first to UC, then the consequences of it will be amplified to multiple downstream gold and silver tables dependent on it. Also, SYNC only syncs from HMS to UC. If upstream jobs make any schema changes to Bronze UC tables, the consumers who are still reading from HMS Bronze tables will not receive those changes unless they start consuming UC tables soon (making things time-sensitive). Therefore we recommend the following:
Target all your gold table consumers first. This includes dashboards, BI tools, ad-hoc queries, views, etc.
Then update the code for the producers of Gold tables. These can be your DBT jobs or queries/processes that aggregate and transform data from the Silver to the Gold layer.
Next, update the producers of your Silver tables.
Update the code of your external data producers that write directly to these HMS tables to point to UC tables.
We recommend referring to Databricks' official documentation for detailed information on each topic related to migrating from Hive Metastore to Unity Catalog in Databricks.