cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Table - Reduce time travel storage size

elgeo
Valued Contributor II

Hello! I am trying to understand time travel feature. I see with "DESCRIBE HISTORY" command that all the transaction history on a specific table is recorded by version and timestamp. However, I understand that this occupies a lot of storage especially if a table is updated every day. Is there a way to remove history or reduce the retention period? What is the minimun and maximum retention period you can have with time travel? I tried the below commands but "DESCRIBE HISTORY" didn't bring different results.

ALTER TABLE table_name

SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 1 HOURS', 'delta.deletedFileRetentionDuration'='interval 1 HOURS')

VACUUM table_name

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

you will have to define the retention interval when doing the vacuum.

VACUUM table_name [RETAIN num HOURS]

There is also a dry run option.

You can go up to 0 hours. Like that all history is deleted. A maximum value I do not know, for sure 30 days is possible, never tested with more than that.

View solution in original post

3 REPLIES 3

-werners-
Esteemed Contributor III

You are almost there. From the help page:

Delta Lake has a safety check to prevent you from running a dangerous

VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property

spark.databricks.delta.retentionDurationCheck.enabled to false.

Also:

It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If

VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when

VACUUM deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

elgeo
Valued Contributor II

Thank you @Werner Stinckensโ€‹ for your reply. However I still haven't managed to delete history even after setting the below. The number of history rows remains the same when running "DESCRIBE HISTORY".

SET spark.databricks.delta.retentionDurationCheck.enabled = false

What I am actually trying to do is removing older history records from delta table history. Moreover is there a minimum and maximum retention period you can have with time travel?

-werners-
Esteemed Contributor III

you will have to define the retention interval when doing the vacuum.

VACUUM table_name [RETAIN num HOURS]

There is also a dry run option.

You can go up to 0 hours. Like that all history is deleted. A maximum value I do not know, for sure 30 days is possible, never tested with more than that.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group