cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Tables incremental backup method

SRS
New Contributor II

Hello,

Does anyone tried to create an incremental backup on delta tables? What I mean is to load into the backup storage only the latest parquet files part of the Delta Table and to refresh the _delta_log folder, instead of copying the whole files again and again.

The principle that I base this method on, is that when new data is added into the Delta Table, a new parquet file is added. So it should be possible to copy only those new files. Is it possible that a parquet file to be changed after its creation?

I am curious if someone else tried and if you think that this is a valid idea and how it would compare with Deep Clone in regards to speed and resource spent ?

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Stefan Stegaru​ ,

You can use Delta time travel to query the data that was just added on a specific version. Then like @Hubert Dudek​  mentioned, you can copy over this sub set of data to a new table or a new location. You will need to do a deep clone to copy over the data from the source. Docs here

View solution in original post

3 REPLIES 3

-werners-
Esteemed Contributor III

I'm gonna answer this with a question 🙂

How are you going to rebuild the latest state of the delta lake table?

Hubert-Dudek
Esteemed Contributor III
  • copy your delta to new location (best adsl/blobstorage in other region)
CREATE OR REPLACE TABLE shared_table CLONE my_prod_table;

  • vacuum all history in new location
%sql
VACUUM delta.`<path-to-table>` RETAIN 0 HOURS
  • remove <path-to-table>/_delta_log in new location

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Stefan Stegaru​ ,

You can use Delta time travel to query the data that was just added on a specific version. Then like @Hubert Dudek​  mentioned, you can copy over this sub set of data to a new table or a new location. You will need to do a deep clone to copy over the data from the source. Docs here

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group