cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Deep Clone

SenthilJ
New Contributor III

Hi,

I am working on a DR design for Databricks in Azure. The recommendation from Databricks is to use Deep Clone to clone the Unity Catalog tables (within or across catalogs). My design is to ensure that DR is managed across different regions i.e. primary and secondary. In my design, active(live) Databricks setup will be hosted in the primary region with its own metastore. A similar setup will be done in the secondary region for the passive instance.

In this case, does Databricks Deep Clone offers cloning of UC objects across two different metastores hosted in primary and secondary regions, one each per region ? If not, is there an alternative to make it work to meet this DR objective? 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @SenthilJ, The recommendation from Databricks to use Deep Clone for cloning Unity Catalog (UC) tables is indeed a prudent approach. Deep Clone facilitates the seamless replication of UC objects, including schemas, managed tables, access permissions, tags, and comments.

    • However, let’s address your specific scenario: active (live) Databricks setup in the primary region with its own metastore and a similar setup in the secondary region for the passive instance.
    • As of my last knowledge update, Deep Clone operates within the same metastore. It does not inherently support cloning UC objects across two different metastores hosted in separate regions.
    • In other words, if you have distinct metastores—one per region—you cannot directly use Deep Clone to synchronize UC objects between them.
      • To achieve your DR objective, consider an alternative approach:
        • Automated Cloning Script: Create a custom cloning script that handles the migration of UC objects across metastores. This script should:
          • Create a new catalog in the secondary region with the desired storage location.
          • Incrementally clone UC objects (schemas, tables, permissions, etc.) from the primary region’s catalog to the secondary region’s catalog.
          • Ensure consistency and integrity during the process.
        • Scheduled Execution: Schedule the script to run periodically or as needed to keep the secondary region’s catalog up-to-date.
        • Testing and Validation: Thoroughly test the script to validate its correctness and reliability.
    • Data Movement: Apart from UC objects, consider how data movement (tables, files, etc.) will be handled between the primary and secondary regions.
    • Network Latency: Account for network latency and bandwidth constraints when synchronizing data across regions.

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @SenthilJ, The recommendation from Databricks to use Deep Clone for cloning Unity Catalog (UC) tables is indeed a prudent approach. Deep Clone facilitates the seamless replication of UC objects, including schemas, managed tables, access permissions, tags, and comments.

    • However, let’s address your specific scenario: active (live) Databricks setup in the primary region with its own metastore and a similar setup in the secondary region for the passive instance.
    • As of my last knowledge update, Deep Clone operates within the same metastore. It does not inherently support cloning UC objects across two different metastores hosted in separate regions.
    • In other words, if you have distinct metastores—one per region—you cannot directly use Deep Clone to synchronize UC objects between them.
      • To achieve your DR objective, consider an alternative approach:
        • Automated Cloning Script: Create a custom cloning script that handles the migration of UC objects across metastores. This script should:
          • Create a new catalog in the secondary region with the desired storage location.
          • Incrementally clone UC objects (schemas, tables, permissions, etc.) from the primary region’s catalog to the secondary region’s catalog.
          • Ensure consistency and integrity during the process.
        • Scheduled Execution: Schedule the script to run periodically or as needed to keep the secondary region’s catalog up-to-date.
        • Testing and Validation: Thoroughly test the script to validate its correctness and reliability.
    • Data Movement: Apart from UC objects, consider how data movement (tables, files, etc.) will be handled between the primary and secondary regions.
    • Network Latency: Account for network latency and bandwidth constraints when synchronizing data across regions.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.