cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Slow performance of VACUUM on Azure Data Lake Store Gen2

gbrueckl
Contributor II

We need to run VACCUM on one of our biggest tables to free the storage. According to our analysis using

VACUUM bigtable DRY RUN

this affects 30M+ files that need to be deleted.

If we run the final VACUUM, the file-listing takes up to 2h (which is OK) but the actual deletion is super slow at about 40 files/second.

We already tried different cluster configuration and spark-settings but none have any significant impact.

The bottleneck seems to be the single-threaded delete of the physical file that runs on the driver only deleting one file at a time.

any ideas how this can be speed up?

10 REPLIES 10

Kaniz
Community Manager
Community Manager

Hi @ gbrueckl ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up shortly with a response.

so, any update from you or your colleagues?

Kaniz
Community Manager
Community Manager

Hi @gbrueckl ,

I've relayed this to my team .

They'll revert back soon .

Thank you for your patience😊 .

Anonymous
Not applicable

@Gerhard Brueckl​ - I'm sorry, we still don't have an update for you. I'm trying again.

User16752246494
Contributor

Hi @Gerhard Brueckl​ ,

VACUUM bigtable DRY RUN

will just list the files for deletion but not actually deleting the physical file. As we see there are around 2M files listed. Once the files are actually vacuumed then you will not see this much time taken for vacuum dry run.

Hi @Sunando Bhattacharya​  and thanks for your reply.

Please re-read my question. The problem is not the time it takes to list the files (regardless of DRY RUN or not) but to deletion of 30M files at a rate of 40 files/second

Deepak_Bhutada
Contributor III

Hi gbrueckl (Customer),

Could you please try enabling below spark config on the cluster and restart it and run vacuum?

spark.databricks.delta.vacuum.parallelDelete.enabled true

sure, already tried that but it is not working at all, probably related to ADLS Gen2

Deepak_Bhutada
Contributor III

@Gerhard Brueckl​ 

we have seen near 80k-120k file deletions in Azure per hour while doing a VACUUM on delta tables, it's just that the vacuum is slower in azure and S3. It might take some time as you said when deleting the files from the delta path.

In order to minimize the cost associated with DBUs while doing a vacuum, you can use autoscaling 0-2 workers on the cheapest instances.

The reason why we recommend auto-scaling with the minimum node as 1, is because, the first step of vacuum, where we read the delta logs and identify the files to be deleted can be very slow if there is only one node for large tables. To avoid this, use cluster resources wisely in step 1 and then in step 2 where we start the deletion from the driver, scale down the executor resources.

How to get estimated # of files deleted in an hour:

You can get a high-level estimate of how many files are getting deleted in an hour by checking for FS_OP_DELETE emitter in the driver logs.

Another way is to run a dry-run command after an hour and see what count it's showing.

the second approach won't give the exact rate of deletion, because, there could be new files identified to be vacuumed when u run the command after an hour. but good to get an estimate

the 80k-120k file deletions per hour as about the same as the 40files/second that we observer - it runs single-threaded on the driver only and you can easily see this in the driver logs

As I need to delete 30M+ files it takes about 300hours -> 12.5 days

we are already using a Single-node cluster as the time we save during the file-listing is negligible compared to the costs as we also experienced that the cluster is not scaling back during the delete operation

I got the estimated files by running a VACUUM DRY RUN via Scala which prints the number of files

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.