Authors: Ehsan Olfat (@esiol) and Vasco Lopes (@vmgl)
Databricks Unity Catalog (UC) is the first unified governance solution for data and AI in the Lakehouse. It enables secure management of diverse data and AI assets on any cloud or platform, fostering collaboration and productivity while providing supporting features for regulatory compliance.
If you already have your assets in a UC catalog, there are reasons that might require reorganize the storage location of your catalogs and schemas. You want the flexibility to change to new managed storage locations, this customization offers several benefits, including:
To leverage these benefits, existing users might aim to migrate by cloning their UC catalogs and data assets within, to new catalogs. Another scenario occurs when customers have been benefiting from these advantages, but need to transition to new storage locations due to organizational changes, for instance, changes in Azure subscriptions or AWS accounts. Additionally, technological reasons, such as rate limits, can play a role. Some storage systems implement rate limiting, which means that when shared across the entire organization, concurrent reading of data might be adversely affected. In all these scenarios, the customers need to clone previous catalogs to new catalogs along with creating new external locations.
In this blog post, we will guide you on how to use a cloning script for creating a new catalog with an updated storage location, and seamlessly cloning all associated Schemas, UC Managed tables, Access Permissions, Tags, and Comments. This task can be quite challenging, especially when dealing with numerous assets existing in the catalog. The script is tailored to move the aforementioned data assets to a new catalog with its designated location, including the possibility of changing the location for the catalog’s schemas as well. This effectively eliminates the need for manual cloning, offering an efficient solution for your catalog location cloning requirements.
It's important to note that this cloning script specifically targets UC Managed tables using the Delta format. Keep in mind that catalogs can contain various other asset types, such as External Tables, Views, Materialized Views, Streaming Tables, External Volumes, Managed Volumes, Functions, and Models. The cloning of these assets falls outside the scope of this blog post and is considered a subject for future work.
NOTE |
This guide/script is about UC catalog to UC catalog cloning, meaning you are already a user of Databricks UC. If you have not migrated to UC yet, you need to upgrade from the Hive Metastore to UC, along with your account, groups, workspaces, jobs, etc. Please see UCX and read the blog How to upgrade your Hive tables to Unity Catalog. |
This guide offers a walkthrough of the cloning process, using a Python script that leverages the Databricks SDK to perform REST API operations. In addition, Spark SQL statements are used for some operations.
A successful cloning operation begins with setting up the environment, ensuring all requirements are in place, to avoid unexpected errors while running the cloning script.
You need to follow the steps below before you run the script that copies data from the source catalog to the new target catalog.
Prior to starting the cloning process, you need to create the storage locations in your cloud, such as AWS S3 buckets or Azure ADLS storage account containers. This location will house the managed data in your target catalog.
You also need to create the Storage Credentials. A Storage Credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant. It needs to exist in order to create External Locations.
In this section, we walk you through steps on how to deploy the cloning script. The source code along with an example notebook are available on this GIT repository.
This cloning script can be run in a Databricks notebook, locally in VScode using the VScode Databricks extension, or by leveraging Databricks Connect. Clone the GIT repository and you will verify that a module called clonecatalog.py exists. This module contains the Python class called CloneCatalog that will automate the cloning process.
Install the Databricks SDK for Python, which offers functionalities to accelerate development with Python for the Databricks Lakehouse. Run the %pip magic command from a notebook cell as follows.
%pip install databricks-sdk --upgrade
If you are in Databricks, after you run the %pip magic command, restart Python. To do this, run the following command from a notebook cell immediately after the cell with the %pip magic command.
dbutils.library.restartPython()
You need to import CloneCatalog from clonecatalog.py.
from clonecatalog import CloneCatalog
Declare the input arguments as follows:
inputs = dict(
source_catalog_external_location_name="your source external location name",
source_catalog_name="your source catalog name",
target_catalog_external_location_pre_req=[
"your target external location name",
"your Storage Credential name",
"your target cloud location url" #ADLS, S3 or GS
],
target_catalog_name="your target catalog name",
)
Same as catalogs, schemas can also have their own managed storage locations. If your schemas need to have a storage location, you can specify the cloud storage locations for the new schemas in the new catalog.
There is an optional parameter that serves the purpose of changing schema locations. Add a Schema name as the key of the dictionary and the prerequisites in a list as the value of the key. If you don't need to specify a location for certain schemas or if you don't wish to change the location for them, simply do not include them in this dictionary.
The list of prerequisites has three items as follows:
schemas_locations_dict = {
"schema1 to change location": [
"your target external location name for schema1",
"your Storage Credential name for schema1",
"your target cloud location url for schema1" #ADLS, S3 or GS
],
"schema2 to change location": [
"your target external location name for schema2",
"your Storage Credential name for schema2",
"your target cloud location url for schema2" #ADLS, S3 or GS
],
...
}
Call an instance of the CloneCatalog class with the input parameters as defined before. As stated before, changing schemas location is optional, so you can remove the input schemas_locations_dict.
clone = CloneCatalog(**inputs, schemas_locations_dict=schemas_locations_dict)
clone()
If you stop and re-run the cloning process, the previously created assets will not be re-created. It will continue cloning the assets that were not cloned in the previous run.
Here is a flowchart illustrating the high level implementation of the cloning process.
NOTE |
Note that the heavy lifting of data cloning takes place when creating the new tables. The script uses Delta Deep Clone to replicate the managed tables in the new schema. Please note that some tables might be quite large, or there might be too many tables in the schema; as a result, the execution time of this process might be lengthy. |
As the cloning process progresses it outputs descriptive messages to the console.
Customizing managed storage locations for catalogs and schemas in UC offers valuable business advantages, including separate billing, enhanced security, regulatory compliance, and data organization. This blog post introduces a cloning script to efficiently create new catalogs and schemas with updated storage locations and seamlessly clone associated data assets, saving time and resources.
The code shared in this blog post is provided under the DB License, granting you the freedom to assess, modify, and adapt it under the terms of the license to meet your specific requirements. It's important to be aware that the authors of this blog post, as well as Databricks, do not assume any responsibility for the code's use, nor do they provide official support for its implementation. Additionally, considering potential limitations in the script, it is crucial to understand that this code is not recommended for direct use in a production environment without thorough testing.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.