Databricks

abaschkim · ‎05-30-2022

I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second. After running the pipeline for a while, I noticed that the tables require a large amount of storage on disk. Some tables require 10x storage compared to the sources.

I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.

DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

So I set the following properties:

ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat` 
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')

and then performed a VACUUM:

VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`

But still, after several days, the size on the disk stays at around 100GB. Although constantly performing a VACUUM. How can I overcome this issue?

Thanks in advance!

-werners- · ‎05-30-2022

Databricks sets the default safety interval to 7 days. You can go below that, as you are trying.

However Delta Lake has a safety check to prevent you from running a dangerous

VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property

spark.databricks.delta.retentionDurationCheck.enabled

to false.

abaschkim · ‎05-30-2022

Thank you for your answer, werners.

I did set this in my Spark config already, unfortunately. Beforehand, the VACUUM command threw a warning as stated in the documentation.

Now I get the following result back:

Deleted 0 files and directories in a total of 1 directories

But it should actually delete older versions, as there are versions older than a week.

-werners- · ‎05-31-2022

it seems the old files are orphaned.

Did you switch from databricks version? maybe the delta lake table was created in another version?

Anonymous · ‎07-29-2022

Hey there @Kim Abasch

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Databricks

Delta Lake table: large volume due to versioning

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs