Databricks Community

esiol · ‎11-24-2023

Authors: Ehsan Olfat (@esiol) and Vasco Lopes (@vmgl)

Introduction

Databricks Unity Catalog (UC) is the first unified governance solution for data and AI in the Lakehouse. It enables secure management of diverse data and AI assets on any cloud or platform, fostering collaboration and productivity while providing supporting features for regulatory compliance.

If you already have your assets in a UC catalog, there are reasons that might require reorganize the storage location of your catalogs and schemas. You want the flexibility to change to new managed storage locations, this customization offers several benefits, including:

enabling separate storage billing for different business units;
isolating PII data for enhanced security and compliance;
meeting regulatory requirements for data storage;
organizing and separating various data domains in distinct cloud object storage locations.

To leverage these benefits, existing users might aim to migrate by cloning their UC catalogs and data assets within, to new catalogs. Another scenario occurs when customers have been benefiting from these advantages, but need to transition to new storage locations due to organizational changes, for instance, changes in Azure subscriptions or AWS accounts. Additionally, technological reasons, such as rate limits, can play a role. Some storage systems implement rate limiting, which means that when shared across the entire organization, concurrent reading of data might be adversely affected. In all these scenarios, the customers need to clone previous catalogs to new catalogs along with creating new external locations.

In this blog post, we will guide you on how to use a cloning script for creating a new catalog with an updated storage location, and seamlessly cloning all associated Schemas, UC Managed tables, Access Permissions, Tags, and Comments. This task can be quite challenging, especially when dealing with numerous assets existing in the catalog. The script is tailored to move the aforementioned data assets to a new catalog with its designated location, including the possibility of changing the location for the catalog’s schemas as well. This effectively eliminates the need for manual cloning, offering an efficient solution for your catalog location cloning requirements.

It's important to note that this cloning script specifically targets UC Managed tables using the Delta format. Keep in mind that catalogs can contain various other asset types, such as External Tables, Views, Materialized Views, Streaming Tables, External Volumes, Managed Volumes, Functions, and Models. The cloning of these assets falls outside the scope of this blog post and is considered a subject for future work.

NOTE

This guide/script is about UC catalog to UC catalog cloning, meaning you are already a user of Databricks UC. If you have not migrated to UC yet, you need to upgrade from the Hive Metastore to UC, along with your account, groups, workspaces, jobs, etc. Please see UCX and read the blog How to upgrade your Hive tables to Unity Catalog.

Step-by-Step Instructions

This guide offers a walkthrough of the cloning process, using a Python script that leverages the Databricks SDK to perform REST API operations. In addition, Spark SQL statements are used for some operations.

Prerequisites

A successful cloning operation begins with setting up the environment, ensuring all requirements are in place, to avoid unexpected errors while running the cloning script.

You need to follow the steps below before you run the script that copies data from the source catalog to the new target catalog.

Preparing the Environment

Create a Unity Catalog enabled cluster with DBR version 13.3 LTS or above. This cluster will later be used to run the Python script.
You need to have the required permissions for accessing the source and creating the target data assets. As the script progresses it notifies you if a required permission is missing.

Creating Cloud Storage Location and Storage Credentials

Prior to starting the cloning process, you need to create the storage locations in your cloud, such as AWS S3 buckets or Azure ADLS storage account containers. This location will house the managed data in your target catalog.

You also need to create the Storage Credentials. A Storage Credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant. It needs to exist in order to create External Locations.

Deploying the Script

In this section, we walk you through steps on how to deploy the cloning script. The source code along with an example notebook are available on this GIT repository.

This cloning script can be run in a Databricks notebook, locally in VScode using the VScode Databricks extension, or by leveraging Databricks Connect. Clone the GIT repository and you will verify that a module called clonecatalog.py exists. This module contains the Python class called CloneCatalog that will automate the cloning process.

1. Installing Databricks SDK

Install the Databricks SDK for Python, which offers functionalities to accelerate development with Python for the Databricks Lakehouse. Run the %pip magic command from a notebook cell as follows.

%pip install databricks-sdk --upgrade

If you are in Databricks, after you run the %pip magic command, restart Python. To do this, run the following command from a notebook cell immediately after the cell with the %pip magic command.

dbutils.library.restartPython()

2. Import Cloning Library

You need to import CloneCatalog from clonecatalog.py.

from clonecatalog import CloneCatalog

3. Setting Input Variables

Declare the input arguments as follows:

The source External Location name
The source Catalog name
The target External Location name along with Storage Credential name and Storage location URL

inputs = dict(
  source_catalog_external_location_name="your source external location name",
  source_catalog_name="your source catalog name",
  target_catalog_external_location_pre_req=[
    "your target external location name",
    "your Storage Credential name",
    "your target cloud location url" #ADLS, S3 or GS
  ],
  target_catalog_name="your target catalog name",
)

4. (Optional) Change Schemas Locations

Same as catalogs, schemas can also have their own managed storage locations. If your schemas need to have a storage location, you can specify the cloud storage locations for the new schemas in the new catalog.

There is an optional parameter that serves the purpose of changing schema locations. Add a Schema name as the key of the dictionary and the prerequisites in a list as the value of the key. If you don't need to specify a location for certain schemas or if you don't wish to change the location for them, simply do not include them in this dictionary.

The list of prerequisites has three items as follows:

The target External Location name
The Storage Credential name
The target Storage location URL

schemas_locations_dict = {
  "schema1 to change location": [
    "your target external location name for schema1",
    "your Storage Credential name for schema1",
    "your target cloud location url for schema1" #ADLS, S3 or GS
  ],
  "schema2 to change location": [
    "your target external location name for schema2",
    "your Storage Credential name for schema2",
    "your target cloud location url for schema2" #ADLS, S3 or GS
  ],
...
}

5. Start Cloning Assets

Call an instance of the CloneCatalog class with the input parameters as defined before. As stated before, changing schemas location is optional, so you can remove the input schemas_locations_dict.

clone = CloneCatalog(**inputs, schemas_locations_dict=schemas_locations_dict)

clone()

If you stop and re-run the cloning process, the previously created assets will not be re-created. It will continue cloning the assets that were not cloned in the previous run.

5.a. The Cloning Process

Here is a flowchart illustrating the high level implementation of the cloning process.

NOTE

Note that the heavy lifting of data cloning takes place when creating the new tables. The script uses Delta Deep Clone to replicate the managed tables in the new schema. Please note that some tables might be quite large, or there might be too many tables in the schema; as a result, the execution time of this process might be lengthy.

5. Output Messages to the Console

As the cloning process progresses it outputs descriptive messages to the console.

Supportability

Databricks clusters with a DBR version equal or higher than 13.3 LTS are supported.
Only UC Managed tables are supported. Supporting all the other data assets are subject to future work.
Granting permissions to the new optional external locations created for the schemas are not defined by the script. The user needs to set permissions manually afterwards.
Owner of the new assets will be the user that executes the script. Original owners are not applied to the new catalog.

Limitations

Deep cloning of the managed tables does not replicate the table history.
Lineage is not replicated.
Catalog-workspace binding is not replicated.
Delta Sharing shares are not updated based on the target catalog.
The pipelines, notebooks and jobs are not updated based on the target catalog.
Consumers and writers are not updated based on the target catalog.
Table insights and quality are not cloned.
If you have already cloned a table, but later added more data to it, and then re-run the script, it will not clone the recently added data.

Conclusion

Customizing managed storage locations for catalogs and schemas in UC offers valuable business advantages, including separate billing, enhanced security, regulatory compliance, and data organization. This blog post introduces a cloning script to efficiently create new catalogs and schemas with updated storage locations and seamlessly clone associated data assets, saving time and resources.

Disclaimer

The code shared in this blog post is provided under the DB License, granting you the freedom to assess, modify, and adapt it under the terms of the license to meet your specific requirements. It's important to be aware that the authors of this blog post, as well as Databricks, do not assume any responsibility for the code's use, nor do they provide official support for its implementation. Additionally, considering potential limitations in the script, it is crucial to understand that this code is not recommended for direct use in a production environment without thorough testing.

Ajay-Pandey · ‎09-08-2024

@esiol Thanks for sharing