Wednesday
Hi,
For DR purposes, we have setup Deep clone using delta share. Each time the deep clone job runs, it executes the query
Wednesday
Hi ,
Deep Clone is incremental. This means that any consecutive DEEP CLONE will result in copying only new data files.
Despite the CREATE OR REPLACE syntax looking like a full overwrite, Delta Lake's DEEP CLONE tracks the Delta log (transaction history) of the source table, not just the data files. Specifically, it records the last cloned version of the source table in the clone's own Delta log.
1st run (full copy):
Subsequent runs:
You can check following article for details:
https://pl.seequality.net/power-clone-functionality-databricks-delta-tables/
If my answer was helpful, please consider marking it as the accepted solution.
Wednesday
Hi @DineshOjha,
Deep clone is incremental, not a full re-copy every time, even when you use CREATE OR REPLACE TABLE โฆ DEEP CLONE โฆ against a Delta Sharing table.
On the first DEEP CLONE, Databricks must read the entire source table (via Delta Sharing)... Copy all data files + metadata into a brand-new Delta table at the target location. This is effectively a full physical copy, so the runtime is proportional to the full table size (and any cross-region / cross-cloud egress).
On subsequent runs of CREATE OR REPLACE TABLE target DEEP CLONE source, the target is already a deep clone, with history that records which source version was last cloned. DEEP CLONE compares the current source version to the version recorded in the targetโs history, then copies only new/changed data files from the source, and also updates the targetโs Delta log with a new commit that references the old files and any newly copied ones. The commit is incremental, not a full rewrite of all files. So your later runs only move the delta since the last clone, which is why they complete much faster.
This behaviour is documented as shown in the snapshot, and the recommended pattern is exactly what youโre using.
To the second part of your question about how it works with delta sharing... from the recipient workspaceโs point of view, once the share is mounted in a catalog, the shared table is just another Delta table that happens to read its files via the Delta Sharing protocol.
DEEP CLONE shared_table -> local_table uses Delta Sharing (signed URLs or cloud tokens) to read the source tableโs data files and copies those files into the targetโs storage and creates a fully independent Delta table (the DR copy). On subsequent DEEP CLONE runs to the same target, the same incremental logic applies... only new/changed files since the last cloned source version are copied, so the job time tracks the size of the changes, not the whole table.
Whilst researching the information for you, I found this blog which I think is still relevant. It may not cover the recent improvements as it is from 2021 but the visuals can help you understand the workings.
Hope that helps.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Wednesday
Hi ,
Deep Clone is incremental. This means that any consecutive DEEP CLONE will result in copying only new data files.
Despite the CREATE OR REPLACE syntax looking like a full overwrite, Delta Lake's DEEP CLONE tracks the Delta log (transaction history) of the source table, not just the data files. Specifically, it records the last cloned version of the source table in the clone's own Delta log.
1st run (full copy):
Subsequent runs:
You can check following article for details:
https://pl.seequality.net/power-clone-functionality-databricks-delta-tables/
If my answer was helpful, please consider marking it as the accepted solution.
Wednesday
Hi @DineshOjha,
Deep clone is incremental, not a full re-copy every time, even when you use CREATE OR REPLACE TABLE โฆ DEEP CLONE โฆ against a Delta Sharing table.
On the first DEEP CLONE, Databricks must read the entire source table (via Delta Sharing)... Copy all data files + metadata into a brand-new Delta table at the target location. This is effectively a full physical copy, so the runtime is proportional to the full table size (and any cross-region / cross-cloud egress).
On subsequent runs of CREATE OR REPLACE TABLE target DEEP CLONE source, the target is already a deep clone, with history that records which source version was last cloned. DEEP CLONE compares the current source version to the version recorded in the targetโs history, then copies only new/changed data files from the source, and also updates the targetโs Delta log with a new commit that references the old files and any newly copied ones. The commit is incremental, not a full rewrite of all files. So your later runs only move the delta since the last clone, which is why they complete much faster.
This behaviour is documented as shown in the snapshot, and the recommended pattern is exactly what youโre using.
To the second part of your question about how it works with delta sharing... from the recipient workspaceโs point of view, once the share is mounted in a catalog, the shared table is just another Delta table that happens to read its files via the Delta Sharing protocol.
DEEP CLONE shared_table -> local_table uses Delta Sharing (signed URLs or cloud tokens) to read the source tableโs data files and copies those files into the targetโs storage and creates a fully independent Delta table (the DR copy). On subsequent DEEP CLONE runs to the same target, the same incremental logic applies... only new/changed files since the last cloned source version are copied, so the job time tracks the size of the changes, not the whole table.
Whilst researching the information for you, I found this blog which I think is still relevant. It may not cover the recent improvements as it is from 2021 but the visuals can help you understand the workings.
Hope that helps.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.