<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best Practices for Copying Data Between Environments in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practices-for-copying-data-between-environments/m-p/101174#M40574</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Using CTAS (CREATE TABLE AS SELECT) might be a more robust solution for your use case:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Independence: CTAS creates a new, independent copy of the data, avoiding dependencies on the source table&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Simplified access control: Access rights can be managed solely within the target environment.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Flexibility: You can easily modify the table structure or apply transformations during the copy process.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2 class="mb-2 mt-6 text-lg first:mt-3"&gt;Optimizing the Cloning Process&lt;/H2&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Use Delta Lake's&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;CODE&gt;CLONE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;command for efficient copying when appropriate&lt;/SPAN&gt;.&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Implement incremental updates to minimize data transfer and processing time for subsequent refreshes.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Consider using Databricks Workflows to automate and schedule the cloning process&lt;/SPAN&gt;
&lt;DIV class="inline-flex h-[1rem] min-w-[1rem] items-center justify-center rounded-full px-[0.3em] text-center font-mono text-[0.60rem] tabular-nums md:hover:text-white border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 transition duration-300 bg-offsetPlus dark:bg-offsetPlusDark md:hover:bg-super"&gt;7&lt;/DIV&gt;
&lt;SPAN&gt;&lt;SPAN class="whitespace-nowrap"&gt;.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2 class="mb-2 mt-6 text-lg first:mt-3"&gt;Addressing Open Questions&lt;/H2&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;The proposed approach is viable, but consider using CTAS instead of shallow clones for better isolation and simpler access management.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Access rights to underlying data files are indeed a concern with shallow clones. CTAS avoids this issue by creating independent copies.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Fri, 06 Dec 2024 06:51:11 GMT</pubDate>
    <dc:creator>Sidhant07</dc:creator>
    <dc:date>2024-12-06T06:51:11Z</dc:date>
    <item>
      <title>Best Practices for Copying Data Between Environments</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-copying-data-between-environments/m-p/88765#M37609</link>
      <description>&lt;P&gt;Hi Everyone,&lt;/P&gt;&lt;P&gt;I'd like to start a discussion about the best practices for copying data between environments. Here's a typical setup:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Environment Setup:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The same region and metastore (Unity Catalog) is used across environments.&lt;/LI&gt;&lt;LI&gt;Each environment has a single catalog.&lt;/LI&gt;&lt;LI&gt;Data needs to be copied from PROD to UAT or SIT to enable testing in a non-empty environment.&lt;/LI&gt;&lt;LI&gt;Role assignments are managed at the catalog/schema level, so there's no need to worry about access controls.&lt;/LI&gt;&lt;LI&gt;Only one catalog is visible per workspace, based on workspace/catalog assignments (in most cases).&lt;/LI&gt;&lt;LI&gt;All tables are owned by a dedicated service principal in every environment.&lt;/LI&gt;&lt;LI&gt;Only the current version of the data is needed; data history is not required.&lt;/LI&gt;&lt;LI&gt;All tables are external Delta tables (still not convinced about managed tables, sorry Databricks!).&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Requirements and Considerations:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Copy or clone tables from a higher environment (e.g., PROD or Staging) to a lower environment (e.g., UAT, SIT, DEV).&lt;/LI&gt;&lt;LI&gt;Minimize effort and cost.&lt;/LI&gt;&lt;LI&gt;A shallow clone might be sufficient, as it creates a new table definition that points to the source table files, effectively achieving a "zero-copy."&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Key Questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;How can we implement this across different environments on the platform?&lt;/LI&gt;&lt;LI&gt;How should ownership and access rights be designed to ensure proper isolation?&lt;/LI&gt;&lt;LI&gt;Users prefer to see only the catalog assigned to a given workspace (showing more could cause confusion).&lt;/LI&gt;&lt;LI&gt;Service principals (SPNs) owning catalogs/schemas should not have access to multiple catalogs to maintain isolation.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;STRONG&gt;Proposed Approach:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Create a dedicated "cloning workspace" with read-only access to all catalogs, including PROD.&lt;/LI&gt;&lt;LI&gt;Restrict access to this workspace to high-level admins and specific service principals only.&lt;/LI&gt;&lt;LI&gt;Establish a "cloner SPN" that has read access to PROD and modify/create table permissions in all other environments.&lt;/LI&gt;&lt;LI&gt;Run an automated process in the "cloning workspace" using the "cloner SPN" to clone/create/replace tables in the target catalog while pointing to the source files.&lt;/LI&gt;&lt;LI&gt;Change the table ownership from the "cloner SPN" to the designated "catalog owner SPN."&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Open Questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Has anyone tried this approach, and will it work as expected?&lt;/LI&gt;&lt;LI&gt;The main concern is around access rights to tables and underlying data files when using a shallow clone. It might be easier to use CTAS (CREATE TABLE AS SELECT) instead.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Please share your thoughts and experiences.&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Thu, 05 Sep 2024 16:14:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-copying-data-between-environments/m-p/88765#M37609</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-09-05T16:14:18Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practices for Copying Data Between Environments</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-copying-data-between-environments/m-p/101174#M40574</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Using CTAS (CREATE TABLE AS SELECT) might be a more robust solution for your use case:&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Independence: CTAS creates a new, independent copy of the data, avoiding dependencies on the source table&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Simplified access control: Access rights can be managed solely within the target environment.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Flexibility: You can easily modify the table structure or apply transformations during the copy process.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2 class="mb-2 mt-6 text-lg first:mt-3"&gt;Optimizing the Cloning Process&lt;/H2&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;Use Delta Lake's&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;CODE&gt;CLONE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;command for efficient copying when appropriate&lt;/SPAN&gt;.&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Implement incremental updates to minimize data transfer and processing time for subsequent refreshes.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Consider using Databricks Workflows to automate and schedule the cloning process&lt;/SPAN&gt;
&lt;DIV class="inline-flex h-[1rem] min-w-[1rem] items-center justify-center rounded-full px-[0.3em] text-center font-mono text-[0.60rem] tabular-nums md:hover:text-white border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 transition duration-300 bg-offsetPlus dark:bg-offsetPlusDark md:hover:bg-super"&gt;7&lt;/DIV&gt;
&lt;SPAN&gt;&lt;SPAN class="whitespace-nowrap"&gt;.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2 class="mb-2 mt-6 text-lg first:mt-3"&gt;Addressing Open Questions&lt;/H2&gt;
&lt;OL class="marker:text-textOff list-decimal pl-8"&gt;
&lt;LI&gt;&lt;SPAN&gt;The proposed approach is viable, but consider using CTAS instead of shallow clones for better isolation and simpler access management.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Access rights to underlying data files are indeed a concern with shallow clones. CTAS avoids this issue by creating independent copies.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Fri, 06 Dec 2024 06:51:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-copying-data-between-environments/m-p/101174#M40574</guid>
      <dc:creator>Sidhant07</dc:creator>
      <dc:date>2024-12-06T06:51:11Z</dc:date>
    </item>
  </channel>
</rss>

