Best Practices for Copying Data Between Environments

KrzysztofPrzyso — Thu, 05 Sep 2024 16:14:18 GMT

Hi Everyone,

I'd like to start a discussion about the best practices for copying data between environments. Here's a typical setup:

Environment Setup:

The same region and metastore (Unity Catalog) is used across environments.
Each environment has a single catalog.
Data needs to be copied from PROD to UAT or SIT to enable testing in a non-empty environment.
Role assignments are managed at the catalog/schema level, so there's no need to worry about access controls.
Only one catalog is visible per workspace, based on workspace/catalog assignments (in most cases).
All tables are owned by a dedicated service principal in every environment.
Only the current version of the data is needed; data history is not required.
All tables are external Delta tables (still not convinced about managed tables, sorry Databricks!).

Requirements and Considerations:

Copy or clone tables from a higher environment (e.g., PROD or Staging) to a lower environment (e.g., UAT, SIT, DEV).
Minimize effort and cost.
A shallow clone might be sufficient, as it creates a new table definition that points to the source table files, effectively achieving a "zero-copy."

Key Questions:

How can we implement this across different environments on the platform?
How should ownership and access rights be designed to ensure proper isolation?
Users prefer to see only the catalog assigned to a given workspace (showing more could cause confusion).
Service principals (SPNs) owning catalogs/schemas should not have access to multiple catalogs to maintain isolation.

Proposed Approach:

Create a dedicated "cloning workspace" with read-only access to all catalogs, including PROD.
Restrict access to this workspace to high-level admins and specific service principals only.
Establish a "cloner SPN" that has read access to PROD and modify/create table permissions in all other environments.
Run an automated process in the "cloning workspace" using the "cloner SPN" to clone/create/replace tables in the target catalog while pointing to the source files.
Change the table ownership from the "cloner SPN" to the designated "catalog owner SPN."

Open Questions:

Has anyone tried this approach, and will it work as expected?
The main concern is around access rights to tables and underlying data files when using a shallow clone. It might be easier to use CTAS (CREATE TABLE AS SELECT) instead.

Please share your thoughts and experiences.

Thank you!

Re: Best Practices for Copying Data Between Environments

Sidhant07 — Fri, 06 Dec 2024 06:51:11 GMT

Using CTAS (CREATE TABLE AS SELECT) might be a more robust solution for your use case:

Independence: CTAS creates a new, independent copy of the data, avoiding dependencies on the source table
Simplified access control: Access rights can be managed solely within the target environment.
Flexibility: You can easily modify the table structure or apply transformations during the copy process.

Use Delta Lake's CLONE command for efficient copying when appropriate.
Implement incremental updates to minimize data transfer and processing time for subsequent refreshes.
Consider using Databricks Workflows to automate and schedule the cloning process
7
.

The proposed approach is viable, but consider using CTAS instead of shallow clones for better isolation and simpler access management.
Access rights to underlying data files are indeed a concern with shallow clones. CTAS avoids this issue by creating independent copies.