I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second. After running the pipeline for a while, I noticed that the tables require a large amount of storage on disk. Some tables require 10x storage compared to the sources.
I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.
DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
So I set the following properties:
ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
and then performed a VACUUM:
VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
But still, after several days, the size on the disk stays at around 100GB. Although constantly performing a VACUUM. How can I overcome this issue?
Thanks in advance!