cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

does databricks respect parallel vacuum setting?

jpassaro
New Contributor

I am trying to run VACUUM on a delta table that i know has millions of obselete files.

out of the box, VACUUM runs the deletes in sequence on the driver. that is bad news for me!

According to OSS delta docs, the setting spark.databricks.delta.vacuum.parallelDelete.enabled will override that behavior and distributes the delete operations to the worker nodes. (apparently you have to opt in because this risks hitting DeleteObject rate limits -- but i'm doing it anyway.)

However there's some ambiguity! the Databricks VACUUM doc doesn't mention this setting. It furthermore asserts that the deletes happen on the driver, and the best way to speed that up is to increase driver size. I doubt this will get the job done for me.

Furthermore, I'm seeing behavior on my VACUUM operation that suggests it isn't being parallelized as I intended. I ran `SET spark.databricks.delta.vacuum.parallelDelete.enabled = true; VACUUM <my table> using inventory (<query on an AWS S3 inventory>);`, using a cluster with autosizing. The cluster autosized up at first, presumably while reading the inventory. Now it has been running over 30 minutes with only 2 workers and no active jobs on it. I have strong suspicion it is running the deletes in parallel on the driver, despite using the parallelDelete setting.

Does databricks actually respect the OSS parallelDelete setting?

 

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @jpassaro , 

Thanks for laying out the context and the links. Let me clarify what’s actually happening here and how I’d recommend moving forward.

Short answer

No. On Databricks Runtime, the spark.databricks.delta.vacuum.parallelDelete.enabled setting from Delta OSS is not used. VACUUM deletions run on the driver only. What you’re seeing—workers going idle during the delete phase—is expected behavior on DBR and aligns with both the public docs and internal guidance.

Why this looks odd at first glance

Databricks VACUUM has two distinct phases:

First, file listing happens in parallel across workers.

Second, the delete phase runs entirely on the driver.

That means scaling executors helps only with listing. If deletes are slow, the lever that matters is the driver (cores and memory), not the worker count.

The confusion usually comes from the Delta OSS documentation. OSS Delta does describe a session-level config that enables parallel deletes, but that code path simply isn’t used in DBR. Setting spark.databricks.delta.vacuum.parallelDelete.enabled won’t fan deletes out to executors on Databricks.

USING INVENTORY is another place this shows up. It can significantly speed up listing by avoiding recursive storage scans, but it does not change delete semantics. Even with inventory-based listing, deletes are still issued by the driver.

Databricks docs vs. Delta OSS docs

Databricks documentation is explicit: file deletion is a driver-only operation and workers will sit idle during that phase. The guidance is to use a small worker pool (often 1–4) and size the driver appropriately when deletes are large or slow.

Delta OSS documentation says you can enable parallel deletes with a Spark config. That guidance applies to OSS deployments, not DBR.

Engineering has confirmed internally that the config is OSS-only and applies to a code path that isn’t exercised in DBR. DBR already uses its own batching and parallel request strategy on the driver, constrained by the underlying cloud object store. There’s no supported way to distribute deletes to executors.

How to speed up very large VACUUM runs on Databricks

If deletes are the bottleneck, focus on the driver:

Increase driver size (cores and memory) to avoid CPU saturation or GC pressure during deletes. The docs suggest 8–32 cores as a starting point; for very large file counts, you may need more.

If you’re on DBR 16.1+ and have had a successful full VACUUM within log retention, use VACUUM LITE. LITE skips full storage listing and deletes only files referenced in the transaction log, which can dramatically reduce runtime.

If soft deletes have caused a buildup of obsolete files, run REORG TABLE … APPLY (PURGE) first, then VACUUM (after retention allows). This often reduces the volume of files that VACUUM has to process.

Expect to see “no Spark tasks” during the delete phase. That’s normal. The driver is issuing storage delete calls directly. If you want visibility, enable VACUUM audit logging via spark.databricks.delta.vacuum.logging.enabled or check driver logs for FS_OP_DELETE entries to confirm progress.

A note on USING INVENTORY

Delta OSS supports VACUUM … USING INVENTORY to accelerate the listing phase using a manifest. On DBR, this syntax isn’t documented. Even if listing is accelerated, deletes remain driver-only per Databricks design.

A practical checklist you can use

-- If compliance allows and shorter retention is acceptable:
ALTER TABLE <table> 
SET TBLPROPERTIES (delta.deletedFileRetentionDuration = '24 hours');

-- If you’ve run a successful full VACUUM recently:
VACUUM <table> LITE;

-- Otherwise, estimate scope and then run a full VACUUM:
VACUUM <table> DRY RUN;
VACUUM <table> FULL;

Key takeaways

Scale the driver for delete-heavy VACUUMs; workers help only with listing.

Run VACUUM off-peak to avoid cloud delete throttling.

Executor-level parallel deletes aren’t a tuning option on DBR—the platform already handles batching and parallelism on the driver.

 

Hope this helps, Louis.