Hi Everyone,
I'd like to start a discussion about the best practices for copying data between environments. Here's a typical setup:
Environment Setup:
- The same region and metastore (Unity Catalog) is used across environments.
- Each environment has a single catalog.
- Data needs to be copied from PROD to UAT or SIT to enable testing in a non-empty environment.
- Role assignments are managed at the catalog/schema level, so there's no need to worry about access controls.
- Only one catalog is visible per workspace, based on workspace/catalog assignments (in most cases).
- All tables are owned by a dedicated service principal in every environment.
- Only the current version of the data is needed; data history is not required.
- All tables are external Delta tables (still not convinced about managed tables, sorry Databricks!).
Requirements and Considerations:
- Copy or clone tables from a higher environment (e.g., PROD or Staging) to a lower environment (e.g., UAT, SIT, DEV).
- Minimize effort and cost.
- A shallow clone might be sufficient, as it creates a new table definition that points to the source table files, effectively achieving a "zero-copy."
Key Questions:
- How can we implement this across different environments on the platform?
- How should ownership and access rights be designed to ensure proper isolation?
- Users prefer to see only the catalog assigned to a given workspace (showing more could cause confusion).
- Service principals (SPNs) owning catalogs/schemas should not have access to multiple catalogs to maintain isolation.
Proposed Approach:
- Create a dedicated "cloning workspace" with read-only access to all catalogs, including PROD.
- Restrict access to this workspace to high-level admins and specific service principals only.
- Establish a "cloner SPN" that has read access to PROD and modify/create table permissions in all other environments.
- Run an automated process in the "cloning workspace" using the "cloner SPN" to clone/create/replace tables in the target catalog while pointing to the source files.
- Change the table ownership from the "cloner SPN" to the designated "catalog owner SPN."
Open Questions:
- Has anyone tried this approach, and will it work as expected?
- The main concern is around access rights to tables and underlying data files when using a shallow clone. It might be easier to use CTAS (CREATE TABLE AS SELECT) instead.
Please share your thoughts and experiences.
Thank you!