szymon_dybczak
Esteemed Contributor III

Hi  ,

Deep Clone is incremental. This means that any consecutive DEEP CLONE will result in copying only new data files.

Despite the CREATE OR REPLACE syntax looking like a full overwrite, Delta Lake's DEEP CLONE tracks the Delta log (transaction history) of the source table, not just the data files. Specifically, it records the last cloned version of the source table in the clone's own Delta log.

1st run (full copy):

  • No previous clone metadata exists
  • Databricks must copy all Parquet data files from the Delta Share source to the target location
  • Also copies the full Delta transaction log
  • Time is proportional to total table size -> hence hours

Subsequent runs:

  • Databricks reads the clone's Delta log to determine the last successfully cloned version
  • It then asks the Delta Share source: "What changed since version X?"
  • Only new or modified files (added/updated/deleted since that version) are physically copied
  • Unchanged files are referenced by the new snapshot without being re-copied
  • Time is proportional to the delta (change volume) since last run -> hence ~15 mins

You can check following article for details:

https://pl.seequality.net/power-clone-functionality-databricks-delta-tables/

If my answer was helpful, please consider marking it as the accepted solution.

View solution in original post