Best Practices for Copying Data Between Environments
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-05-2024 09:14 AM
Hi Everyone,
I'd like to start a discussion about the best practices for copying data between environments. Here's a typical setup:
Environment Setup:
- The same region and metastore (Unity Catalog) is used across environments.
- Each environment has a single catalog.
- Data needs to be copied from PROD to UAT or SIT to enable testing in a non-empty environment.
- Role assignments are managed at the catalog/schema level, so there's no need to worry about access controls.
- Only one catalog is visible per workspace, based on workspace/catalog assignments (in most cases).
- All tables are owned by a dedicated service principal in every environment.
- Only the current version of the data is needed; data history is not required.
- All tables are external Delta tables (still not convinced about managed tables, sorry Databricks!).
Requirements and Considerations:
- Copy or clone tables from a higher environment (e.g., PROD or Staging) to a lower environment (e.g., UAT, SIT, DEV).
- Minimize effort and cost.
- A shallow clone might be sufficient, as it creates a new table definition that points to the source table files, effectively achieving a "zero-copy."
Key Questions:
- How can we implement this across different environments on the platform?
- How should ownership and access rights be designed to ensure proper isolation?
- Users prefer to see only the catalog assigned to a given workspace (showing more could cause confusion).
- Service principals (SPNs) owning catalogs/schemas should not have access to multiple catalogs to maintain isolation.
Proposed Approach:
- Create a dedicated "cloning workspace" with read-only access to all catalogs, including PROD.
- Restrict access to this workspace to high-level admins and specific service principals only.
- Establish a "cloner SPN" that has read access to PROD and modify/create table permissions in all other environments.
- Run an automated process in the "cloning workspace" using the "cloner SPN" to clone/create/replace tables in the target catalog while pointing to the source files.
- Change the table ownership from the "cloner SPN" to the designated "catalog owner SPN."
Open Questions:
- Has anyone tried this approach, and will it work as expected?
- The main concern is around access rights to tables and underlying data files when using a shallow clone. It might be easier to use CTAS (CREATE TABLE AS SELECT) instead.
Please share your thoughts and experiences.
Thank you!
1 REPLY 1
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-05-2024 10:51 PM
Using CTAS (CREATE TABLE AS SELECT) might be a more robust solution for your use case:
- Independence: CTAS creates a new, independent copy of the data, avoiding dependencies on the source table
- Simplified access control: Access rights can be managed solely within the target environment.
- Flexibility: You can easily modify the table structure or apply transformations during the copy process.
Optimizing the Cloning Process
- Use Delta Lake's
CLONE
command for efficient copying when appropriate. - Implement incremental updates to minimize data transfer and processing time for subsequent refreshes.
- Consider using Databricks Workflows to automate and schedule the cloning process
7.
Addressing Open Questions
- The proposed approach is viable, but consider using CTAS instead of shallow clones for better isolation and simpler access management.
- Access rights to underlying data files are indeed a concern with shallow clones. CTAS avoids this issue by creating independent copies.

