<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic does databricks respect parallel vacuum setting? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/does-databricks-respect-parallel-vacuum-setting/m-p/142201#M51900</link>
    <description>&lt;P&gt;I am trying to run VACUUM on a delta table that i know has millions of obselete files.&lt;BR /&gt;&lt;BR /&gt;out of the box, VACUUM runs the deletes in sequence on the driver. that is bad news for me!&lt;BR /&gt;&lt;BR /&gt;According to &lt;A href="https://docs.delta.io/delta-utility/#remove-files-no-longer-referenced-by-a-delta-table" target="_self"&gt;OSS delta docs&lt;/A&gt;, the setting &lt;FONT face="courier new,courier"&gt;spark.databricks.delta.vacuum.parallelDelete.enabled&lt;/FONT&gt; will override that behavior and distributes the delete operations to the worker nodes. (apparently you have to opt in because this risks hitting DeleteObject rate limits -- but i'm doing it anyway.)&lt;BR /&gt;&lt;BR /&gt;However there's some ambiguity! the &lt;A href="https://docs.databricks.com/aws/en/delta/vacuum#what-size-cluster-does-vacuum-need" target="_blank" rel="noopener"&gt;Databricks VACUUM doc&lt;/A&gt; doesn't mention this setting. It furthermore asserts that the deletes happen on the driver, and the best way to speed that up is to increase driver size. I doubt this will get the job done for me.&lt;/P&gt;&lt;P&gt;Furthermore, I'm seeing behavior on my VACUUM operation that suggests it isn't being parallelized as I intended. I ran `SET spark.databricks.delta.vacuum.parallelDelete.enabled = true; VACUUM &amp;lt;my table&amp;gt; using inventory (&amp;lt;query on an AWS S3 inventory&amp;gt;);`, using a cluster with autosizing. The cluster autosized up at first, presumably while reading the inventory. Now it has been running over 30 minutes with only 2 workers and no active jobs on it. I have strong suspicion it is running the deletes in parallel on the driver, despite using the parallelDelete setting.&lt;/P&gt;&lt;P&gt;Does databricks actually respect the OSS parallelDelete setting?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 18 Dec 2025 21:47:56 GMT</pubDate>
    <dc:creator>jpassaro</dc:creator>
    <dc:date>2025-12-18T21:47:56Z</dc:date>
    <item>
      <title>does databricks respect parallel vacuum setting?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-respect-parallel-vacuum-setting/m-p/142201#M51900</link>
      <description>&lt;P&gt;I am trying to run VACUUM on a delta table that i know has millions of obselete files.&lt;BR /&gt;&lt;BR /&gt;out of the box, VACUUM runs the deletes in sequence on the driver. that is bad news for me!&lt;BR /&gt;&lt;BR /&gt;According to &lt;A href="https://docs.delta.io/delta-utility/#remove-files-no-longer-referenced-by-a-delta-table" target="_self"&gt;OSS delta docs&lt;/A&gt;, the setting &lt;FONT face="courier new,courier"&gt;spark.databricks.delta.vacuum.parallelDelete.enabled&lt;/FONT&gt; will override that behavior and distributes the delete operations to the worker nodes. (apparently you have to opt in because this risks hitting DeleteObject rate limits -- but i'm doing it anyway.)&lt;BR /&gt;&lt;BR /&gt;However there's some ambiguity! the &lt;A href="https://docs.databricks.com/aws/en/delta/vacuum#what-size-cluster-does-vacuum-need" target="_blank" rel="noopener"&gt;Databricks VACUUM doc&lt;/A&gt; doesn't mention this setting. It furthermore asserts that the deletes happen on the driver, and the best way to speed that up is to increase driver size. I doubt this will get the job done for me.&lt;/P&gt;&lt;P&gt;Furthermore, I'm seeing behavior on my VACUUM operation that suggests it isn't being parallelized as I intended. I ran `SET spark.databricks.delta.vacuum.parallelDelete.enabled = true; VACUUM &amp;lt;my table&amp;gt; using inventory (&amp;lt;query on an AWS S3 inventory&amp;gt;);`, using a cluster with autosizing. The cluster autosized up at first, presumably while reading the inventory. Now it has been running over 30 minutes with only 2 workers and no active jobs on it. I have strong suspicion it is running the deletes in parallel on the driver, despite using the parallelDelete setting.&lt;/P&gt;&lt;P&gt;Does databricks actually respect the OSS parallelDelete setting?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2025 21:47:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-respect-parallel-vacuum-setting/m-p/142201#M51900</guid>
      <dc:creator>jpassaro</dc:creator>
      <dc:date>2025-12-18T21:47:56Z</dc:date>
    </item>
    <item>
      <title>Re: does databricks respect parallel vacuum setting?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-databricks-respect-parallel-vacuum-setting/m-p/142262#M51908</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/201409"&gt;@jpassaro&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;Thanks for laying out the context and the links. Let me clarify what’s actually happening here and how I’d recommend moving forward.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Short answer&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;No. On Databricks Runtime, the spark.databricks.delta.vacuum.parallelDelete.enabled setting from Delta OSS is not used. VACUUM deletions run on the driver only. What you’re seeing—workers going idle during the delete phase—is expected behavior on DBR and aligns with both the public docs and internal guidance.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Why this looks odd at first glance&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Databricks VACUUM has two distinct phases:&lt;/P&gt;
&lt;P class="p1"&gt;First, file listing happens in parallel across workers.&lt;/P&gt;
&lt;P class="p1"&gt;Second, the delete phase runs entirely on the driver.&lt;/P&gt;
&lt;P class="p1"&gt;That means scaling executors helps only with listing. If deletes are slow, the lever that matters is the driver (cores and memory), not the worker count.&lt;/P&gt;
&lt;P class="p1"&gt;The confusion usually comes from the Delta OSS documentation. OSS Delta does describe a session-level config that enables parallel deletes, but that code path simply isn’t used in DBR. Setting spark.databricks.delta.vacuum.parallelDelete.enabled won’t fan deletes out to executors on Databricks.&lt;/P&gt;
&lt;P class="p1"&gt;USING INVENTORY is another place this shows up. It can significantly speed up listing by avoiding recursive storage scans, but it does not change delete semantics. Even with inventory-based listing, deletes are still issued by the driver.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Databricks docs vs. Delta OSS docs&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Databricks documentation is explicit: file deletion is a driver-only operation and workers will sit idle during that phase. The guidance is to use a small worker pool (often 1–4) and size the driver appropriately when deletes are large or slow.&lt;/P&gt;
&lt;P class="p1"&gt;Delta OSS documentation says you can enable parallel deletes with a Spark config. That guidance applies to OSS deployments, not DBR.&lt;/P&gt;
&lt;P class="p1"&gt;Engineering has confirmed internally that the config is OSS-only and applies to a code path that isn’t exercised in DBR. DBR already uses its own batching and parallel request strategy on the driver, constrained by the underlying cloud object store. There’s no supported way to distribute deletes to executors.&lt;/P&gt;
&lt;P class="p1"&gt;How to speed up very large VACUUM runs on Databricks&lt;/P&gt;
&lt;P class="p1"&gt;If deletes are the bottleneck, focus on the driver:&lt;/P&gt;
&lt;P class="p1"&gt;Increase driver size (cores and memory) to avoid CPU saturation or GC pressure during deletes. The docs suggest 8–32 cores as a starting point; for very large file counts, you may need more.&lt;/P&gt;
&lt;P class="p1"&gt;If you’re on DBR 16.1+ and have had a successful full VACUUM within log retention, use VACUUM LITE. LITE skips full storage listing and deletes only files referenced in the transaction log, which can dramatically reduce runtime.&lt;/P&gt;
&lt;P class="p1"&gt;If soft deletes have caused a buildup of obsolete files, run REORG TABLE … APPLY (PURGE) first, then VACUUM (after retention allows). This often reduces the volume of files that VACUUM has to process.&lt;/P&gt;
&lt;P class="p1"&gt;Expect to see “no Spark tasks” during the delete phase. That’s normal. The driver is issuing storage delete calls directly. If you want visibility, enable VACUUM audit logging via spark.databricks.delta.vacuum.logging.enabled or check driver logs for FS_OP_DELETE entries to confirm progress.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;A note on USING INVENTORY&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Delta OSS supports VACUUM … USING INVENTORY to accelerate the listing phase using a manifest. On DBR, this syntax isn’t documented. Even if listing is accelerated, deletes remain driver-only per Databricks design.&lt;/P&gt;
&lt;P class="p1"&gt;A practical checklist you can use&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;-- If compliance allows and shorter retention is acceptable:
ALTER TABLE &amp;lt;table&amp;gt; 
SET TBLPROPERTIES (delta.deletedFileRetentionDuration = '24 hours');

-- If you’ve run a successful full VACUUM recently:
VACUUM &amp;lt;table&amp;gt; LITE;

-- Otherwise, estimate scope and then run a full VACUUM:
VACUUM &amp;lt;table&amp;gt; DRY RUN;
VACUUM &amp;lt;table&amp;gt; FULL;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;Key takeaways&lt;/P&gt;
&lt;P class="p1"&gt;Scale the driver for delete-heavy VACUUMs; workers help only with listing.&lt;/P&gt;
&lt;P class="p1"&gt;Run VACUUM off-peak to avoid cloud delete throttling.&lt;/P&gt;
&lt;P class="p1"&gt;Executor-level parallel deletes aren’t a tuning option on DBR—the platform already handles batching and parallelism on the driver.&lt;/P&gt;
&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Dec 2025 13:31:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-databricks-respect-parallel-vacuum-setting/m-p/142262#M51908</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-12-19T13:31:04Z</dc:date>
    </item>
  </channel>
</rss>

