Databricks Community

gbrueckl · ‎09-10-2021

We need to run VACCUM on one of our biggest tables to free the storage. According to our analysis using

VACUUM bigtable DRY RUN

this affects 30M+ files that need to be deleted.

If we run the final VACUUM, the file-listing takes up to 2h (which is OK) but the actual deletion is super slow at about 40 files/second.

We already tried different cluster configuration and spark-settings but none have any significant impact.

The bottleneck seems to be the single-threaded delete of the physical file that runs on the driver only deleting one file at a time.

any ideas how this can be speed up?

gbrueckl · ‎09-16-2021

so, any update from you or your colleagues?

Anonymous · ‎09-23-2021

@Gerhard Brueckl - I'm sorry, we still don't have an update for you. I'm trying again.

User16752246494 · ‎09-27-2021

Hi @Gerhard Brueckl ,

VACUUM bigtable DRY RUN

will just list the files for deletion but not actually deleting the physical file. As we see there are around 2M files listed. Once the files are actually vacuumed then you will not see this much time taken for vacuum dry run.

gbrueckl · ‎09-27-2021

Hi @Sunando Bhattacharya and thanks for your reply.

Please re-read my question. The problem is not the time it takes to list the files (regardless of DRY RUN or not) but to deletion of 30M files at a rate of 40 files/second

Deepak_Bhutada · ‎10-20-2021

Hi gbrueckl (Customer),

Could you please try enabling below spark config on the cluster and restart it and run vacuum?

spark.databricks.delta.vacuum.parallelDelete.enabled true

gbrueckl · ‎10-20-2021

sure, already tried that but it is not working at all, probably related to ADLS Gen2

Deepak_Bhutada · ‎10-20-2021

@Gerhard Brueckl

we have seen near 80k-120k file deletions in Azure per hour while doing a VACUUM on delta tables, it's just that the vacuum is slower in azure and S3. It might take some time as you said when deleting the files from the delta path.

In order to minimize the cost associated with DBUs while doing a vacuum, you can use autoscaling 0-2 workers on the cheapest instances.

The reason why we recommend auto-scaling with the minimum node as 1, is because, the first step of vacuum, where we read the delta logs and identify the files to be deleted can be very slow if there is only one node for large tables. To avoid this, use cluster resources wisely in step 1 and then in step 2 where we start the deletion from the driver, scale down the executor resources.

How to get estimated # of files deleted in an hour:

You can get a high-level estimate of how many files are getting deleted in an hour by checking for FS_OP_DELETE emitter in the driver logs.

Another way is to run a dry-run command after an hour and see what count it's showing.

the second approach won't give the exact rate of deletion, because, there could be new files identified to be vacuumed when u run the command after an hour. but good to get an estimate

gbrueckl · ‎10-20-2021

the 80k-120k file deletions per hour as about the same as the 40files/second that we observer - it runs single-threaded on the driver only and you can easily see this in the driver logs

As I need to delete 30M+ files it takes about 300hours -> 12.5 days

we are already using a Single-node cluster as the time we save during the file-listing is negligible compared to the costs as we also experienced that the cluster is not scaling back during the delete operation

I got the estimated files by running a VACUUM DRY RUN via Scala which prints the number of files

Databricks Community

Slow performance of VACUUM on Azure Data Lake Store Gen2

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon