Databricks recommends four methods to migrate Hive tables to Unity Catalog, each with its pros and cons. The choice of method depends on specific requirements.
- SYNC: A SQL command that migrates schema or tables to Unity Catalog external tables. However, tracking multiple tables (e.g., 1000s) can be challenging on tracking the migration.
- CLONE: A SQL command that performs a deep clone, migrating Hive-managed tables to Unity Catalog-managed tables. This method allows for individual table execution, making it a suggested approach.
- UCX: A command-based tool currently in Databricks Labs, not fully approved for production use. While it may work for some applications, it's not recommended for production migration. –it’s my thought maybe you can use for migration
- Unity Catalog Upgrade Wizard (UI-based tool): A IU based tool for quickly upgrading Hive tables to Unity Catalog external tables. However, it's not suitable for large-scale production migrations.
- Using the SYNC and CLONE create your own databricks workflow / api: I know finally all will ends up with this method.
What I am thinking ?
Migrating to Unity Catalog is not just a simple table object recreation from Hive to Unity Catalog. Instead of that take it's an opportunity to create a more robust and efficient data lake by leveraging Databricks' advanced features, such as liquid clustering, SQL Warehouse, Delta Live Tables, governance, security, Delta Sharing, and Databricks Asset Bundles etc.
When you initially built your application on Hive tables, Databricks may not have been as advanced, and your application created the data pipelines using Notebooks, ADF, and complex Scala jars. Additionally, you may not have been following software engineering best practices, such as test coverage, continuous integration, and continuous deployment (CI/CD).
It's a known fact - 2-3 years ago, you may not have been a SME in Databricks, but now you have gained a deeper understanding of the tool and its features, as well as hands-on experience with your application, platform and cloud services, which has changed your perspective and approach to using Databricks.
If we migrate the hive table to UC and uses the same legacy data pipelines is like
"Same salted recipe, but plate changed" – if the recipe is same taste also same.
Kick out the notebooks deployment and Scala code (sorry, Scala devs!) and instead write your code in DLT, Python, or dbt (hello, SQL devs!). Utilize Databricks asset bundles to consolidate your entire application infrastructure in one place. Establish uniform code patterns and sustainable practices, leveraging features and DLT's data quality capabilities to design robust data pipelines.
Conclusion note:
Take the opportunity of UC migration refactor your data lake application.