cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake table: large volume due to versioning

abaschkim
New Contributor II

I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second. After running the pipeline for a while, I noticed that the tables require a large amount of storage on disk. Some tables require 10x storage compared to the sources.

I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.

DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

So I set the following properties:

ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat` 
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')

and then performed a VACUUM:

VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

But still, after several days, the size on the disk stays at around 100GB. Although constantly performing a VACUUM. How can I overcome this issue?

Thanks in advance!

4 REPLIES 4

-werners-
Esteemed Contributor III

Databricks sets the default safety interval to 7 days. You can go below that, as you are trying.

However Delta Lake has a safety check to prevent you from running a dangerous

VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property

spark.databricks.delta.retentionDurationCheck.enabled

to false.

abaschkim
New Contributor II

Thank you for your answer, werners.

I did set this in my Spark config already, unfortunately. Beforehand, the VACUUM command threw a warning as stated in the documentation.

Now I get the following result back:

Deleted 0 files and directories in a total of 1 directories

But it should actually delete older versions, as there are versions older than a week.

-werners-
Esteemed Contributor III

it seems the old files are orphaned.

Did you switch from databricks version? maybe the delta lake table was created in another version?

Anonymous
Not applicable

Hey there @Kim Abasch​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.