09-10-2021 02:36 AM
We need to run VACCUM on one of our biggest tables to free the storage. According to our analysis using
VACUUM bigtable DRY RUN
this affects 30M+ files that need to be deleted.
If we run the final VACUUM, the file-listing takes up to 2h (which is OK) but the actual deletion is super slow at about 40 files/second.
We already tried different cluster configuration and spark-settings but none have any significant impact.
The bottleneck seems to be the single-threaded delete of the physical file that runs on the driver only deleting one file at a time.
any ideas how this can be speed up?
09-16-2021 12:35 PM
so, any update from you or your colleagues?
09-23-2021 06:08 AM
@Gerhard Brueckl - I'm sorry, we still don't have an update for you. I'm trying again.
09-27-2021 10:40 AM
Hi @Gerhard Brueckl ,
VACUUM bigtable DRY RUN
will just list the files for deletion but not actually deleting the physical file. As we see there are around 2M files listed. Once the files are actually vacuumed then you will not see this much time taken for vacuum dry run.
09-27-2021 12:10 PM
Hi @Sunando Bhattacharya and thanks for your reply.
Please re-read my question. The problem is not the time it takes to list the files (regardless of DRY RUN or not) but to deletion of 30M files at a rate of 40 files/second
10-20-2021 04:15 AM
Hi gbrueckl (Customer),
Could you please try enabling below spark config on the cluster and restart it and run vacuum?
spark.databricks.delta.vacuum.parallelDelete.enabled true
10-20-2021 04:19 AM
sure, already tried that but it is not working at all, probably related to ADLS Gen2
10-20-2021 05:28 AM
@Gerhard Brueckl
we have seen near 80k-120k file deletions in Azure per hour while doing a VACUUM on delta tables, it's just that the vacuum is slower in azure and S3. It might take some time as you said when deleting the files from the delta path.
In order to minimize the cost associated with DBUs while doing a vacuum, you can use autoscaling 0-2 workers on the cheapest instances.
The reason why we recommend auto-scaling with the minimum node as 1, is because, the first step of vacuum, where we read the delta logs and identify the files to be deleted can be very slow if there is only one node for large tables. To avoid this, use cluster resources wisely in step 1 and then in step 2 where we start the deletion from the driver, scale down the executor resources.
How to get estimated # of files deleted in an hour:
You can get a high-level estimate of how many files are getting deleted in an hour by checking for FS_OP_DELETE emitter in the driver logs.
Another way is to run a dry-run command after an hour and see what count it's showing.
the second approach won't give the exact rate of deletion, because, there could be new files identified to be vacuumed when u run the command after an hour. but good to get an estimate
10-20-2021 06:25 AM
the 80k-120k file deletions per hour as about the same as the 40files/second that we observer - it runs single-threaded on the driver only and you can easily see this in the driver logs
As I need to delete 30M+ files it takes about 300hours -> 12.5 days
we are already using a Single-node cluster as the time we save during the file-listing is negligible compared to the costs as we also experienced that the cluster is not scaling back during the delete operation
I got the estimated files by running a VACUUM DRY RUN via Scala which prints the number of files
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group