cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results for 
Search instead for 
Did you mean: 

Migration of Unity Catalog files with Databricks migration tool

weinino
New Contributor II

Hi there,

I'm currently migrating Databricks (metastores, workspaces, etc.) from Azure to AWS using the Databricks migration tool. During the migration process and after digging in the code, I've come to the conclusion that the tool only supports a migration of the the built-in Hive metastore but not custom metastores in Unity Catalog.

Q1: Is this correct or have I overseen something?

If yes, I thought about extending the code, i.e. adapting the export / import functionality for the Hive metastore in the HiveClient to support other custom metastores from Unity Catalog as well. Specifically, this would mean adaptations in the MetastoreExportTask and MetastoreTableACLExportTask.

Q2: Is there anything else to consider?

I've also realised, that the migration tool is only exporting / importing database and table definitions, but not the data itself. This is also stated on the migration tool's documentation page.

"Note on DBFS Data Migration:

DBFS is a protected object storage location on AWS and Azure. Please contact your Databricks support team for information about migrating DBFS resources."

Q3: What is the preferred way to migrate the data in the DBFS from Azure to AWS? Is it possible to just move all files / folders under the old DBFS root to the new DBFS root?

Thanks a lot in advance! 🙂

6 REPLIES 6

Vinay123
New Contributor III

I am also looking for replication of unity catlog to other region on aws​. So that I can attach to different workspace in the same region

karthik_p
Esteemed Contributor

@Nino Weingart​ yes you are right, as far as unity catalog migration current scripts not supported, databricks team may help if their field engineering team is working on it

coming to data migration, 1. what is your current DBFS size 2. Is there any un wanted data 3. remove un used data 4. if it is lesser in size Databricks team can help 5. if size is more consider converting data into externa tables if they are managed tables and then your data will reside on external storage and you can use cloud specific tools to migrate your data

Vinay123
New Contributor III

https://community.databricks.com/s/question/0D58Y0000ACZL5vSQH/unity-catlog-replication-or-disaster-...

Can you please suggest any solution or your thoughts on this @karthik p​ 

weinino
New Contributor II

@karthik p​ Ok, thanks for the confirmation.

Data migration:

Disclaimer: The migration we are doing is rather a "POC" to get an understanding on the process and limitations for our clients.

  1. It is no real workspace and therefore the DBFS is very small. It should however resemble a case with several GB of data.
  2. Same as 1., but for simplicity I would say we need all the data.
  3. Same as 2., all data should be migrated
  4. Ok, I see. Could you confirm, that moving all files / folders under the old DBFS root to the new one with any suitable data transfer approach between cloud vendors should work without any addition?
  5. When I understood you correctly, this would mean,
    1. Pre-migration: Create an external table for each managed table an copy over the data
    2. Optional: Update all references in the old workspace to work with the external data if the old scripts still need to work
    3. On migration: Copy over the data to the new cloud vendors data storage
    4. Optional: Create managed table in new workspace for each migrated external table, copy the data and change back the references to use the internal tables

karthik_p
Esteemed Contributor

@Nino Weingart​ once you convert into external table, data will be on external storage, you can either mount that or copy to new storage in target cloud and if you feel you want to go with managed then you need to convert back them. metadata migration will be take care by scripts and databricks team might help on dbfs based on size. for each managed table data as external copy makes sense ( that seems to be good idea, to lessen your downtime). after you migrate your metadata to target , table skeleton will be present and you need to map your data as either external/managed. lesser file size data should work with cp, but if size of data is more them it takes many days to move using cp

Anonymous
Not applicable

@Nino Weingart​ :

Q1: You are correct that the Databricks migration tool only supports migration of the built-in Hive metastore and not custom metastores in Unity Catalog. If you want to migrate custom metastores, you would need to extend the code and adapt the export/import functionality in the HiveClient.

Q2: When extending the code, you should also consider the potential impact on the performance and stability of the migration tool. Make sure to thoroughly test your changes before using them in a production environment.

Q3: Unfortunately, there is no direct way to migrate data in DBFS from Azure to AWS using the Databricks migration tool. As stated in the documentation, DBFS is a protected object storage location on both AWS and Azure, so you will need to contact your Databricks support team for information on how to migrate DBFS resources.

One possible solution for migrating the data in DBFS is to use a third-party tool such as AzCopy or AWS DataSync to copy the files/folders from the old DBFS root to the new DBFS root. However, be aware that there may be differences in the format of the files/folders between the two cloud providers, so you may need to make some adjustments or transformations to the data during the migration process. Additionally, it's important to make sure that the migration process does not disrupt any ongoing workloads or data pipelines that rely on the DBFS data.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group