cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Table - Reduce time travel storage size

elgeo
Valued Contributor II

Hello! I am trying to understand time travel feature. I see with "DESCRIBE HISTORY" command that all the transaction history on a specific table is recorded by version and timestamp. However, I understand that this occupies a lot of storage especially if a table is updated every day. Is there a way to remove history or reduce the retention period? What is the minimun and maximum retention period you can have with time travel? I tried the below commands but "DESCRIBE HISTORY" didn't bring different results.

ALTER TABLE table_name

SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 1 HOURS', 'delta.deletedFileRetentionDuration'='interval 1 HOURS')

VACUUM table_name

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

you will have to define the retention interval when doing the vacuum.

VACUUM table_name [RETAIN num HOURS]

There is also a dry run option.

You can go up to 0 hours. Like that all history is deleted. A maximum value I do not know, for sure 30 days is possible, never tested with more than that.

View solution in original post

3 REPLIES 3

-werners-
Esteemed Contributor III

You are almost there. From the help page:

Delta Lake has a safety check to prevent you from running a dangerous

VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property

spark.databricks.delta.retentionDurationCheck.enabled to false.

Also:

It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If

VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when

VACUUM deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

elgeo
Valued Contributor II

Thank you @Werner Stinckens​ for your reply. However I still haven't managed to delete history even after setting the below. The number of history rows remains the same when running "DESCRIBE HISTORY".

SET spark.databricks.delta.retentionDurationCheck.enabled = false

What I am actually trying to do is removing older history records from delta table history. Moreover is there a minimum and maximum retention period you can have with time travel?

-werners-
Esteemed Contributor III

you will have to define the retention interval when doing the vacuum.

VACUUM table_name [RETAIN num HOURS]

There is also a dry run option.

You can go up to 0 hours. Like that all history is deleted. A maximum value I do not know, for sure 30 days is possible, never tested with more than that.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.