topic Re: Very long vacuum on s3 in Data Engineering

Very long vacuum on s3

alonisser — Sat, 26 Apr 2025 19:49:13 GMT

Since we've moved from azure to aws, a specific job has extremely long vacuum runs,

is there a specific flag/configuration for the s3 storage that is needed to support faster vacuum.

How can I research what's going on?

Note, it's not ALL jobs, but a specific job.

Any tips what I should be looking for?

Re: Very long vacuum on s3

iyashk-DB — Wed, 30 Apr 2025 07:28:20 GMT

Hey @alonisser, On Azure and GCP VACUUM, the deletion is performed in parallel on the driver when using Databricks Runtime 10.4 LTS or above. The higher the number of driver cores, the more the operation can be parallelised. But on AWS, deletes happen in batches, and the process is single-threaded. AWS uses a bulk delete API and deletes in batches of 1000, but it doesn’t use parallel threads. As a result, using a multi-core driver may not help on AWS.

For Best Practises on VACUUM, please refer - https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake

Re: Very long vacuum on s3

NandiniN — Thu, 01 May 2025 05:02:54 GMT

For faster Vacuum run performance,

(1) avoid over-partitioned directories

(2) avoid concurrent runs (during vacuum command run)

(3) avoid enabling S3 versioning (As delta lake itself maintains the history)

(4) run periodic “optimize” command,

(5) enable autoCompaction/autoOptimize on the delta table

(6) use latest/higher DBR with auto-scaling cluster (for faster listing) with compute optimized instance types.

Also, currently the default checkpointInterval is 100, but if you are on a lower DBR it would be 10, you can alter this property to 100 for checkpoint files to be created every 100 commits.

- Since Vacuum is compute intensive , use compute optimized instance types like C5 series instances (for AWS)