<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Very long vacuum on s3 in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/117092#M45424</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25768"&gt;@alonisser&lt;/a&gt;, On Azure and GCP VACUUM, the deletion is performed in parallel on the driver&amp;nbsp;when using Databricks Runtime 10.4 LTS or above. The higher the number of driver cores, the more the operation can be parallelised. But on AWS, deletes happen in batches, and the process is single-threaded.&amp;nbsp;AWS uses a bulk delete API and deletes in batches of 1000, but it doesn’t use parallel threads. As a result, using a multi-core driver may not help on AWS.&lt;/P&gt;
&lt;P&gt;For Best Practises on VACUUM, please refer -&amp;nbsp;&lt;A href="https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake" target="_blank"&gt;https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 30 Apr 2025 07:28:20 GMT</pubDate>
    <dc:creator>iyashk-DB</dc:creator>
    <dc:date>2025-04-30T07:28:20Z</dc:date>
    <item>
      <title>Very long vacuum on s3</title>
      <link>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/116658#M45362</link>
      <description>&lt;P&gt;Since we've moved from azure to aws, a specific job has extremely long vacuum runs,&amp;nbsp;&lt;/P&gt;&lt;P&gt;is there a specific flag/configuration for the s3 storage that is needed to support faster vacuum.&lt;/P&gt;&lt;P&gt;How can I research what's going on?&lt;/P&gt;&lt;P&gt;Note, it's not ALL jobs, but a specific job.&lt;/P&gt;&lt;P&gt;Any tips what I should be looking for?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 26 Apr 2025 19:49:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/116658#M45362</guid>
      <dc:creator>alonisser</dc:creator>
      <dc:date>2025-04-26T19:49:13Z</dc:date>
    </item>
    <item>
      <title>Re: Very long vacuum on s3</title>
      <link>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/117092#M45424</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25768"&gt;@alonisser&lt;/a&gt;, On Azure and GCP VACUUM, the deletion is performed in parallel on the driver&amp;nbsp;when using Databricks Runtime 10.4 LTS or above. The higher the number of driver cores, the more the operation can be parallelised. But on AWS, deletes happen in batches, and the process is single-threaded.&amp;nbsp;AWS uses a bulk delete API and deletes in batches of 1000, but it doesn’t use parallel threads. As a result, using a multi-core driver may not help on AWS.&lt;/P&gt;
&lt;P&gt;For Best Practises on VACUUM, please refer -&amp;nbsp;&lt;A href="https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake" target="_blank"&gt;https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 07:28:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/117092#M45424</guid>
      <dc:creator>iyashk-DB</dc:creator>
      <dc:date>2025-04-30T07:28:20Z</dc:date>
    </item>
    <item>
      <title>Re: Very long vacuum on s3</title>
      <link>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/117291#M45467</link>
      <description>&lt;P data-renderer-start-pos="10578"&gt;For faster Vacuum run performance,&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(1) avoid over-partitioned directories&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(2) avoid concurrent runs (during vacuum command run)&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(3) avoid enabling S3 versioning (&lt;SPAN&gt;As delta lake itself maintains the history)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(4) run periodic “optimize” command,&amp;nbsp;&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(5) enable autoCompaction/autoOptimize on the delta table&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;(6) use latest/higher DBR with auto-scaling cluster (for faster listing)&amp;nbsp;with compute optimized instance types.&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;Also, currently the default&amp;nbsp;&lt;STRONG&gt;checkpointInterval&lt;/STRONG&gt; is 100, but if you are on a lower DBR it would be 10, you can alter this property to 100 for&lt;SPAN&gt;&amp;nbsp;checkpoint files to be created every 100 commits.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-renderer-start-pos="10578"&gt;&lt;SPAN&gt;- Since Vacuum is compute intensive , use compute optimized instance types like C5 series instances (for AWS) &lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2025 05:02:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/very-long-vacuum-on-s3/m-p/117291#M45467</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-05-01T05:02:54Z</dc:date>
    </item>
  </channel>
</rss>

