cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How Deep clone works

DineshOjha
New Contributor III

Hi,

For DR purposes, we have setup Deep clone using delta share. Each time the deep clone job runs, it executes the query 

create or replace table {schema}.{table} deep clone {delta_share}.{schema}.{table} 
The 1st time the job ran, it took few hours to complete, but it has been completing in 15 mins for the subsequent runs.
From my understanding, deep clone replaces the entire table each time, so why did the job take few hours to deep clone for the 1st run and lesser time after that?
Can someone please help understand how deep clone works with delta share?
2 ACCEPTED SOLUTIONS

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi  ,

Deep Clone is incremental. This means that any consecutive DEEP CLONE will result in copying only new data files.

Despite the CREATE OR REPLACE syntax looking like a full overwrite, Delta Lake's DEEP CLONE tracks the Delta log (transaction history) of the source table, not just the data files. Specifically, it records the last cloned version of the source table in the clone's own Delta log.

1st run (full copy):

  • No previous clone metadata exists
  • Databricks must copy all Parquet data files from the Delta Share source to the target location
  • Also copies the full Delta transaction log
  • Time is proportional to total table size -> hence hours

Subsequent runs:

  • Databricks reads the clone's Delta log to determine the last successfully cloned version
  • It then asks the Delta Share source: "What changed since version X?"
  • Only new or modified files (added/updated/deleted since that version) are physically copied
  • Unchanged files are referenced by the new snapshot without being re-copied
  • Time is proportional to the delta (change volume) since last run -> hence ~15 mins

You can check following article for details:

https://pl.seequality.net/power-clone-functionality-databricks-delta-tables/

If my answer was helpful, please consider marking it as the accepted solution.

View solution in original post

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @DineshOjha,

Deep clone is incremental, not a full re-copy every time, even when you use CREATE OR REPLACE TABLE โ€ฆ DEEP CLONE โ€ฆ against a Delta Sharing table.

On the first DEEP CLONE, Databricks must read the entire source table (via Delta Sharing)... Copy all data files + metadata into a brand-new Delta table at the target location. This is effectively a full physical copy, so the runtime is proportional to the full table size (and any cross-region / cross-cloud egress).

On subsequent runs of CREATE OR REPLACE TABLE target DEEP CLONE source, the target is already a deep clone, with history that records which source version was last cloned. DEEP CLONE compares the current source version to the version recorded in the targetโ€™s history, then copies only new/changed data files from the source, and also updates the targetโ€™s Delta log with a new commit that references the old files and any newly copied ones. The commit is incremental, not a full rewrite of all files. So your later runs only move the delta since the last clone, which is why they complete much faster.

This behaviour is documented as shown in the snapshot, and the recommended pattern is exactly what youโ€™re using.Ashwin_DSA_0-1776892917680.png

To the second part of your question about how it works with delta sharing... from the recipient workspaceโ€™s point of view, once the share is mounted in a catalog, the shared table is just another Delta table that happens to read its files via the Delta Sharing protocol.

DEEP CLONE shared_table -> local_table uses Delta Sharing (signed URLs or cloud tokens) to read the source tableโ€™s data files and copies those files into the targetโ€™s storage and creates a fully independent Delta table (the DR copy). On subsequent DEEP CLONE runs to the same target, the same incremental logic applies... only new/changed files since the last cloned source version are copied, so the job time tracks the size of the changes, not the whole table.

Whilst researching the information for you, I found this blog which I think is still relevant. It may not cover the recent improvements as it is from 2021 but the visuals can help you understand the workings.

Hope that helps.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi  ,

Deep Clone is incremental. This means that any consecutive DEEP CLONE will result in copying only new data files.

Despite the CREATE OR REPLACE syntax looking like a full overwrite, Delta Lake's DEEP CLONE tracks the Delta log (transaction history) of the source table, not just the data files. Specifically, it records the last cloned version of the source table in the clone's own Delta log.

1st run (full copy):

  • No previous clone metadata exists
  • Databricks must copy all Parquet data files from the Delta Share source to the target location
  • Also copies the full Delta transaction log
  • Time is proportional to total table size -> hence hours

Subsequent runs:

  • Databricks reads the clone's Delta log to determine the last successfully cloned version
  • It then asks the Delta Share source: "What changed since version X?"
  • Only new or modified files (added/updated/deleted since that version) are physically copied
  • Unchanged files are referenced by the new snapshot without being re-copied
  • Time is proportional to the delta (change volume) since last run -> hence ~15 mins

You can check following article for details:

https://pl.seequality.net/power-clone-functionality-databricks-delta-tables/

If my answer was helpful, please consider marking it as the accepted solution.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @DineshOjha,

Deep clone is incremental, not a full re-copy every time, even when you use CREATE OR REPLACE TABLE โ€ฆ DEEP CLONE โ€ฆ against a Delta Sharing table.

On the first DEEP CLONE, Databricks must read the entire source table (via Delta Sharing)... Copy all data files + metadata into a brand-new Delta table at the target location. This is effectively a full physical copy, so the runtime is proportional to the full table size (and any cross-region / cross-cloud egress).

On subsequent runs of CREATE OR REPLACE TABLE target DEEP CLONE source, the target is already a deep clone, with history that records which source version was last cloned. DEEP CLONE compares the current source version to the version recorded in the targetโ€™s history, then copies only new/changed data files from the source, and also updates the targetโ€™s Delta log with a new commit that references the old files and any newly copied ones. The commit is incremental, not a full rewrite of all files. So your later runs only move the delta since the last clone, which is why they complete much faster.

This behaviour is documented as shown in the snapshot, and the recommended pattern is exactly what youโ€™re using.Ashwin_DSA_0-1776892917680.png

To the second part of your question about how it works with delta sharing... from the recipient workspaceโ€™s point of view, once the share is mounted in a catalog, the shared table is just another Delta table that happens to read its files via the Delta Sharing protocol.

DEEP CLONE shared_table -> local_table uses Delta Sharing (signed URLs or cloud tokens) to read the source tableโ€™s data files and copies those files into the targetโ€™s storage and creates a fully independent Delta table (the DR copy). On subsequent DEEP CLONE runs to the same target, the same incremental logic applies... only new/changed files since the last cloned source version are copied, so the job time tracks the size of the changes, not the whole table.

Whilst researching the information for you, I found this blog which I think is still relevant. It may not cover the recent improvements as it is from 2021 but the visuals can help you understand the workings.

Hope that helps.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***