04-17-2023 08:17 AM
Hi there,
I'm currently migrating Databricks (metastores, workspaces, etc.) from Azure to AWS using the Databricks migration tool. During the migration process and after digging in the code, I've come to the conclusion that the tool only supports a migration of the the built-in Hive metastore but not custom metastores in Unity Catalog.
Q1: Is this correct or have I overseen something?
If yes, I thought about extending the code, i.e. adapting the export / import functionality for the Hive metastore in the HiveClient to support other custom metastores from Unity Catalog as well. Specifically, this would mean adaptations in the MetastoreExportTask and MetastoreTableACLExportTask.
Q2: Is there anything else to consider?
I've also realised, that the migration tool is only exporting / importing database and table definitions, but not the data itself. This is also stated on the migration tool's documentation page.
"Note on DBFS Data Migration:
DBFS is a protected object storage location on AWS and Azure. Please contact your Databricks support team for information about migrating DBFS resources."
Q3: What is the preferred way to migrate the data in the DBFS from Azure to AWS? Is it possible to just move all files / folders under the old DBFS root to the new DBFS root?
Thanks a lot in advance! 🙂
04-17-2023 01:13 PM
I am also looking for replication of unity catlog to other region on aws. So that I can attach to different workspace in the same region
04-17-2023 01:23 PM
@Nino Weingart yes you are right, as far as unity catalog migration current scripts not supported, databricks team may help if their field engineering team is working on it
coming to data migration, 1. what is your current DBFS size 2. Is there any un wanted data 3. remove un used data 4. if it is lesser in size Databricks team can help 5. if size is more consider converting data into externa tables if they are managed tables and then your data will reside on external storage and you can use cloud specific tools to migrate your data
04-17-2023 10:33 PM
Can you please suggest any solution or your thoughts on this @karthik p
04-18-2023 12:18 AM
@karthik p Ok, thanks for the confirmation.
Data migration:
Disclaimer: The migration we are doing is rather a "POC" to get an understanding on the process and limitations for our clients.
04-18-2023 06:15 AM
@Nino Weingart once you convert into external table, data will be on external storage, you can either mount that or copy to new storage in target cloud and if you feel you want to go with managed then you need to convert back them. metadata migration will be take care by scripts and databricks team might help on dbfs based on size. for each managed table data as external copy makes sense ( that seems to be good idea, to lessen your downtime). after you migrate your metadata to target , table skeleton will be present and you need to map your data as either external/managed. lesser file size data should work with cp, but if size of data is more them it takes many days to move using cp
04-18-2023 01:55 AM
@Nino Weingart :
Q1: You are correct that the Databricks migration tool only supports migration of the built-in Hive metastore and not custom metastores in Unity Catalog. If you want to migrate custom metastores, you would need to extend the code and adapt the export/import functionality in the HiveClient.
Q2: When extending the code, you should also consider the potential impact on the performance and stability of the migration tool. Make sure to thoroughly test your changes before using them in a production environment.
Q3: Unfortunately, there is no direct way to migrate data in DBFS from Azure to AWS using the Databricks migration tool. As stated in the documentation, DBFS is a protected object storage location on both AWS and Azure, so you will need to contact your Databricks support team for information on how to migrate DBFS resources.
One possible solution for migrating the data in DBFS is to use a third-party tool such as AzCopy or AWS DataSync to copy the files/folders from the old DBFS root to the new DBFS root. However, be aware that there may be differences in the format of the files/folders between the two cloud providers, so you may need to make some adjustments or transformations to the data during the migration process. Additionally, it's important to make sure that the migration process does not disrupt any ongoing workloads or data pipelines that rely on the DBFS data.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group