cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Practices for Copying Data Between Environments

KrzysztofPrzyso
New Contributor III

Hi Everyone,

I'd like to start a discussion about the best practices for copying data between environments. Here's a typical setup:

Environment Setup:

  • The same region and metastore (Unity Catalog) is used across environments.
  • Each environment has a single catalog.
  • Data needs to be copied from PROD to UAT or SIT to enable testing in a non-empty environment.
  • Role assignments are managed at the catalog/schema level, so there's no need to worry about access controls.
  • Only one catalog is visible per workspace, based on workspace/catalog assignments (in most cases).
  • All tables are owned by a dedicated service principal in every environment.
  • Only the current version of the data is needed; data history is not required.
  • All tables are external Delta tables (still not convinced about managed tables, sorry Databricks!).

Requirements and Considerations:

  • Copy or clone tables from a higher environment (e.g., PROD or Staging) to a lower environment (e.g., UAT, SIT, DEV).
  • Minimize effort and cost.
  • A shallow clone might be sufficient, as it creates a new table definition that points to the source table files, effectively achieving a "zero-copy."

Key Questions:

  1. How can we implement this across different environments on the platform?
  2. How should ownership and access rights be designed to ensure proper isolation?
  3. Users prefer to see only the catalog assigned to a given workspace (showing more could cause confusion).
  4. Service principals (SPNs) owning catalogs/schemas should not have access to multiple catalogs to maintain isolation.

Proposed Approach:

  • Create a dedicated "cloning workspace" with read-only access to all catalogs, including PROD.
  • Restrict access to this workspace to high-level admins and specific service principals only.
  • Establish a "cloner SPN" that has read access to PROD and modify/create table permissions in all other environments.
  • Run an automated process in the "cloning workspace" using the "cloner SPN" to clone/create/replace tables in the target catalog while pointing to the source files.
  • Change the table ownership from the "cloner SPN" to the designated "catalog owner SPN."

Open Questions:

  • Has anyone tried this approach, and will it work as expected?
  • The main concern is around access rights to tables and underlying data files when using a shallow clone. It might be easier to use CTAS (CREATE TABLE AS SELECT) instead.

Please share your thoughts and experiences.

Thank you!

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group