<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Delta Lake table:  large volume due to versioning in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19241#M12877</link>
    <description>&lt;P&gt;it seems the old files are orphaned.&lt;/P&gt;&lt;P&gt;Did you switch from databricks version?  maybe the delta lake table was created in another version?&lt;/P&gt;</description>
    <pubDate>Tue, 31 May 2022 09:50:30 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-05-31T09:50:30Z</dc:date>
    <item>
      <title>Delta Lake table:  large volume due to versioning</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19238#M12874</link>
      <description>&lt;P&gt;I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second. After running the pipeline for a while, I noticed that the tables require a large amount of storage on disk. Some tables require 10x storage compared to the sources.&lt;/P&gt;&lt;P&gt;I investigated the Delta Lake table versioning. When I&amp;nbsp;DESCRIBE a selected table, it's stated that the&amp;nbsp;sizeInBytes&amp;nbsp;is actually around 10&amp;nbsp;GB, although the corresponding folder on the disk takes over 100&amp;nbsp;GB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;So I set the following properties:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat` 
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;and then performed a VACUUM:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But still, after several days, the size on the disk stays at around 100GB. Although constantly performing a VACUUM. How can I overcome this issue?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 14:31:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19238#M12874</guid>
      <dc:creator>abaschkim</dc:creator>
      <dc:date>2022-05-30T14:31:29Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake table:  large volume due to versioning</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19239#M12875</link>
      <description>&lt;P&gt;Databricks sets the default safety interval to 7 days.  You can go below that, as you are trying.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However Delta Lake has a safety check to prevent you from running a dangerous&lt;/P&gt;&lt;P&gt;VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property &lt;/P&gt;&lt;P&gt;spark.databricks.delta.retentionDurationCheck.enabled &lt;/P&gt;&lt;P&gt;to false.&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 14:56:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19239#M12875</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-05-30T14:56:27Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake table:  large volume due to versioning</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19240#M12876</link>
      <description>&lt;P&gt;Thank you for your answer, werners.&lt;/P&gt;&lt;P&gt;I did set this in my Spark config already, unfortunately. Beforehand, the VACUUM command threw a warning as stated in the documentation.&lt;/P&gt;&lt;P&gt;Now I get the following result back:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Deleted 0 files and directories in a total of 1 directories&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But it should actually delete older versions, as there are versions older than a week.&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 15:23:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19240#M12876</guid>
      <dc:creator>abaschkim</dc:creator>
      <dc:date>2022-05-30T15:23:07Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake table:  large volume due to versioning</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19241#M12877</link>
      <description>&lt;P&gt;it seems the old files are orphaned.&lt;/P&gt;&lt;P&gt;Did you switch from databricks version?  maybe the delta lake table was created in another version?&lt;/P&gt;</description>
      <pubDate>Tue, 31 May 2022 09:50:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19241#M12877</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-05-31T09:50:30Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake table:  large volume due to versioning</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19242#M12878</link>
      <description>&lt;P&gt;Hey there @Kim Abasch​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2022 16:38:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-table-large-volume-due-to-versioning/m-p/19242#M12878</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-07-29T16:38:25Z</dc:date>
    </item>
  </channel>
</rss>

