Delta Tables incremental backup method

SRS
New Contributor II

Hello,

Does anyone tried to create an incremental backup on delta tables? What I mean is to load into the backup storage only the latest parquet files part of the Delta Table and to refresh the _delta_log folder, instead of copying the whole files again and again.

The principle that I base this method on, is that when new data is added into the Delta Table, a new parquet file is added. So it should be possible to copy only those new files. Is it possible that a parquet file to be changed after its creation?

I am curious if someone else tried and if you think that this is a valid idea and how it would compare with Deep Clone in regards to speed and resource spent ?

-werners-
Esteemed Contributor III

I'm gonna answer this with a question 🙂

How are you going to rebuild the latest state of the delta lake table?

Hubert-Dudek
Databricks MVP
  • copy your delta to new location (best adsl/blobstorage in other region)
CREATE OR REPLACE TABLE shared_table CLONE my_prod_table;

  • vacuum all history in new location
%sql
VACUUM delta.`<path-to-table>` RETAIN 0 HOURS
  • remove <path-to-table>/_delta_log in new location

My blog: https://databrickster.medium.com/

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Stefan Stegaru​ ,

You can use Delta time travel to query the data that was just added on a specific version. Then like @Hubert Dudek​  mentioned, you can copy over this sub set of data to a new table or a new location. You will need to do a deep clone to copy over the data from the source. Docs here

View solution in original post