Databricks Community

Dhaval_Bagadia · ‎11-10-2023

Are you using Hive Metastore on Databricks or an external Hive Metastore such as Glue? Do you want to migrate to Unity Catalog (UC), but need help figuring out where to start or what the migration process entails? If your answer is yes, then this article is for you. We will assist you in planning your UC migration and make the process less intimidating by providing you with all the necessary information.

At a high level, several topics need to be addressed as part of the migration journey:

Unity Catalog Metastore setup
Upgrade user management to account level
Understand the current landscape of your Glue or Hive Metastore
Design Unity Catalog architecture
Migration to UC
Cutover your jobs, dashboard, and queries to point to UC

Please refer to the execution/project plan (CLICK HERE) for detailed steps and links to Databricks documentation. The plan will guide you through the migration process with timelines to keep your project on track. Let's review each topic in detail to ensure a clear understanding.

1. Unity Catalog Metastore Setup:

To enable the Unity Catalog in Databricks, it is necessary to set up the Unity Catalog Metastore along with various objects such as Storage Credentials and External Locations. This involves the creation of AWS resources, IAM roles, and policies, as well as corresponding storage credentials and external locations within Databricks.

2. Upgrade User Management to Account Level:

If you have set up your SCIM provisioning at the workspace level using Okta, Entra ID (Azure AD), or other identity providers, it is recommended that you move this setup to the Account level. This will allow you to manage it at the Account level, rather than having to set it up for each workspace you create in Databricks. In addition, it is recommended that you enable SSO at the account level and SSO federation to the workspace (unified login) for a streamlined experience. If you have created groups within Databricks (not managed by the Identity provider), such as by using Terraform or creating ad hoc groups within the Databricks workspace, special steps are required. Please refer to the project plan for specific documentation that will help you accomplish this.

3. Understand the Current Hive Landscape:

In order to determine the complexity of the migration process, it is important to review the complete list of tables and views in your current Hive Metastore. This not only involves understanding the list of databases and tables you have but also the following details:

The file formats (Delta, Parquet, CSV, Hive SerDe, etc.)
Whether tables are managed or external
S3/ADLS paths for external tables
Whether tables are on DBFS, if managed, etc.

Having this information will help you select the appropriate migration pattern for most of the tables. If the file format is Delta/Parquet, you can easily migrate using the SYNC command. For other file formats, you may need to use CTAS (Create table as Select) to UC, or perhaps first convert them to Delta and then use SYNC. Refer to the plan for more details.

4. Design Unity Catalog Architecture:

Defining the segregation of data assets is a crucial phase in the design of UC. During this phase, you will decide how to structure your catalogs and databases (schemas) based on teams, environments (prod, staging, dev), or any other logical approach. You should evaluate the scalability of this data segregation/cataloging design over time and how you plan to manage user permissions (granularity-wise). To design your data landscape in UC, follow the best practices outlined in the Databricks documentation.

5. Migration to Unity Catalog:

The first step is to set up the catalogs and schemas in Unity Catalog, and then migrate the tables to UC. You can accomplish this through the SYNC command for delta/parquet file formats, or other mechanisms for non-delta/parquet file formats. You can refer to the plan for reviewing different migration mechanisms. Once the migration is complete, the next important step is to grant permissions to users, groups, and service principals on UC tables, external locations, and assets/objects to prepare for the next stage.

6. Cutover Your Jobs, Dashboard, and Queries:

Migrating tables to UC is only half the job done. It is important to plan for upgrading your code (queries, jobs, dashboards, etc.) that still point to the HMS tables. Luckily, the SYNC command provides you with enough breathing room to update your code to point to UC assets over time as it ensures that the HMS and corresponding tables in UC are always in sync. You can approach this based on how your org/teams are organized. Two approaches that we have seen success with:

Pipeline by pipeline to ensure your entire pipeline is running on UC tables end-to-end. This approach helps you apply any learnings on dos and don'ts to the next pipeline.

OR

Data layer by layer ( Gold, Silver, and Bronze): You can approach these layers in either direction but we felt targeting the Gold layer first made the process least disruptive. Say something goes wrong when repointing the Bronze layer first to UC, then the consequences of it will be amplified to multiple downstream gold and silver tables dependent on it. Also, SYNC only syncs from HMS to UC. If upstream jobs make any schema changes to Bronze UC tables, the consumers who are still reading from HMS Bronze tables will not receive those changes unless they start consuming UC tables soon (making things time-sensitive). Therefore we recommend the following:

Target all your gold table consumers first. This includes dashboards, BI tools, ad-hoc queries, views, etc.
Then update the code for the producers of Gold tables. These can be your DBT jobs or queries/processes that aggregate and transform data from the Silver to the Gold layer.
Next, update the producers of your Silver tables.
Update the code of your external data producers that write directly to these HMS tables to point to UC tables.

We recommend referring to Databricks' official documentation for detailed information on each topic related to migrating from Hive Metastore to Unity Catalog in Databricks.

Cypher · ‎06-11-2024

Great write up! What I'd like to know is why Databricks made it so difficult to migrate to Unity Catalog (UC)? It would be a logical for the company to either create a robust migration tool or offer the capability to integrate a catalog with the Hive Metastore.

Dhaval_Bagadia · ‎06-11-2024

We could have done a better job upfront to make it easy and we are filling in that gap as well speak and getting it right.

Recently, a tool was set up (regularly updated with more capabilities as we speak) that helps automate migration to UC to a good extent. You can check that out. https://github.com/databrickslabs/ucx.

Cypher · ‎06-11-2024

We have used UCX tool in January-2024 and was more of an assessment tool. Event today it's not a tool that one can "unleash" in their environment to perform a migration. For one UCX still doesn't have notebook migrations. If you are using RDD, DBFS, or APIs in notebooks - you're pretty much out of luck with easy UC migrations.

Other issues:

Migration is a cost: both people and data movement cost (AWS + Databricks' multipurpose clusters)
Testing is a cost: people, plus one have to run 2 parallel environments (AWS + Databricks costs) to ensure that everything works ... or take a tremendous risk plunging their entire environment into UC.
3 part names. This can be mitigated by using default catalog, but that requires testing as well ... or forcing all your customers to rewrite their queries.

Dhaval_Bagadia · ‎06-12-2024

With managed tables on DBFS root and a few other cases, the UC migration is not straightforward.

do you mainly have managed tables on DBFS root? if yes, I agree with the time and complexity concern and don't really have an easy answer there except to recreate the table with data in the UC catalog.

But if you use external tables, it gets quite simple, and also, you don't really have to set up two environments as you can try it out first in your dev/QA and do in-place updates to code in production gradually over time.

Generally, I have seen customers have a mix of managed and external tables, but the %age of external tables is high...so only a small subset of assets and related notebooks, jobs etc, have high complexity.

Also, it will be worth checking in with your databricks account executive on different options to help speed up migration through different avenues such as delivery solution architects, professional services, and partner services.

satya1206 · ‎10-10-2024

HI @Dhaval_Bagadia ,

Thanks for the detailed Plan . Could you please share similar plan if you have for Azure as well ?

Databricks Community

Planning a Migration to Unity Catalog

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks