cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Lake table: large volume due to versioning

abaschkim
New Contributor II

I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second. After running the pipeline for a while, I noticed that the tables require a large amount of storage on disk. Some tables require 10x storage compared to the sources.

I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.

DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

So I set the following properties:

ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat` 
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')

and then performed a VACUUM:

VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

But still, after several days, the size on the disk stays at around 100GB. Although constantly performing a VACUUM. How can I overcome this issue?

Thanks in advance!

4 REPLIES 4

-werners-
Esteemed Contributor III

Databricks sets the default safety interval to 7 days. You can go below that, as you are trying.

However Delta Lake has a safety check to prevent you from running a dangerous

VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property

spark.databricks.delta.retentionDurationCheck.enabled

to false.

abaschkim
New Contributor II

Thank you for your answer, werners.

I did set this in my Spark config already, unfortunately. Beforehand, the VACUUM command threw a warning as stated in the documentation.

Now I get the following result back:

Deleted 0 files and directories in a total of 1 directories

But it should actually delete older versions, as there are versions older than a week.

-werners-
Esteemed Contributor III

it seems the old files are orphaned.

Did you switch from databricks version? maybe the delta lake table was created in another version?

Anonymous
Not applicable

Hey there @Kim Abasch​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!