Authors: Dhaval Bagadia, Ziyuan Qin
Are you using Hive Metastore on Databricks or an external Hive Metastore such as Glue? Do you want to migrate to Unity Catalog (UC), but need help figuring out where to start or what the migration process entails? If your answer is yes, then this article is for you. We will assist you in planning your UC migration and make the process less intimidating by providing you with all the necessary information.
At a high level, several topics need to be addressed as part of the migration journey:
Please refer to the execution/project plan (CLICK HERE) for detailed steps and links to Databricks documentation. The plan will guide you through the migration process with timelines to keep your project on track. Let's review each topic in detail to ensure a clear understanding.
1. Unity Catalog Metastore Setup:
To enable the Unity Catalog in Databricks, it is necessary to set up the Unity Catalog Metastore along with various objects such as Storage Credentials and External Locations. This involves the creation of AWS resources, IAM roles, and policies, as well as corresponding storage credentials and external locations within Databricks.
2. Upgrade User Management to Account Level:
If you have set up your SCIM provisioning at the workspace level using Okta, Entra ID (Azure AD), or other identity providers, it is recommended that you move this setup to the Account level. This will allow you to manage it at the Account level, rather than having to set it up for each workspace you create in Databricks. In addition, it is recommended that you enable SSO at the account level and SSO federation to the workspace (unified login) for a streamlined experience. If you have created groups within Databricks (not managed by the Identity provider), such as by using Terraform or creating ad hoc groups within the Databricks workspace, special steps are required. Please refer to the project plan for specific documentation that will help you accomplish this.
3. Understand the Current Hive Landscape:
In order to determine the complexity of the migration process, it is important to review the complete list of tables and views in your current Hive Metastore. This not only involves understanding the list of databases and tables you have but also the following details:
Having this information will help you select the appropriate migration pattern for most of the tables. If the file format is Delta/Parquet, you can easily migrate using the SYNC command. For other file formats, you may need to use CTAS (Create table as Select) to UC, or perhaps first convert them to Delta and then use SYNC. Refer to the plan for more details.
4. Design Unity Catalog Architecture:
Defining the segregation of data assets is a crucial phase in the design of UC. During this phase, you will decide how to structure your catalogs and databases (schemas) based on teams, environments (prod, staging, dev), or any other logical approach. You should evaluate the scalability of this data segregation/cataloging design over time and how you plan to manage user permissions (granularity-wise). To design your data landscape in UC, follow the best practices outlined in the Databricks documentation.
5. Migration to Unity Catalog:
The first step is to set up the catalogs and schemas in Unity Catalog, and then migrate the tables to UC. You can accomplish this through the SYNC command for delta/parquet file formats, or other mechanisms for non-delta/parquet file formats. You can refer to the plan for reviewing different migration mechanisms. Once the migration is complete, the next important step is to grant permissions to users, groups, and service principals on UC tables, external locations, and assets/objects to prepare for the next stage.
6. Cutover Your Jobs, Dashboard, and Queries:
Migrating tables to UC is only half the job done. It is important to plan for upgrading your code (queries, jobs, dashboards, etc.) that still point to the HMS tables. Luckily, the SYNC command provides you with enough breathing room to update your code to point to UC assets over time as it ensures that the HMS and corresponding tables in UC are always in sync. You can approach this based on how your org/teams are organized. Two approaches that we have seen success with:
OR
We recommend referring to Databricks' official documentation for detailed information on each topic related to migrating from Hive Metastore to Unity Catalog in Databricks.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.