cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Very long vacuum on s3

alonisser
Contributor II

Since we've moved from azure to aws, a specific job has extremely long vacuum runs, 

is there a specific flag/configuration for the s3 storage that is needed to support faster vacuum.

How can I research what's going on?

Note, it's not ALL jobs, but a specific job.

Any tips what I should be looking for? 

2 REPLIES 2

iyashk-DB
Databricks Employee
Databricks Employee

Hey @alonisser, On Azure and GCP VACUUM, the deletion is performed in parallel on the driver when using Databricks Runtime 10.4 LTS or above. The higher the number of driver cores, the more the operation can be parallelised. But on AWS, deletes happen in batches, and the process is single-threaded. AWS uses a bulk delete API and deletes in batches of 1000, but it doesnโ€™t use parallel threads. As a result, using a multi-core driver may not help on AWS.

For Best Practises on VACUUM, please refer - https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake

NandiniN
Databricks Employee
Databricks Employee

For faster Vacuum run performance,

(1) avoid over-partitioned directories

(2) avoid concurrent runs (during vacuum command run)

(3) avoid enabling S3 versioning (As delta lake itself maintains the history)

(4) run periodic โ€œoptimizeโ€ command, 

(5) enable autoCompaction/autoOptimize on the delta table

(6) use latest/higher DBR with auto-scaling cluster (for faster listing) with compute optimized instance types.

Also, currently the default checkpointInterval is 100, but if you are on a lower DBR it would be 10, you can alter this property to 100 for checkpoint files to be created every 100 commits.

- Since Vacuum is compute intensive , use compute optimized instance types like C5 series instances (for AWS)