03-06-2022 05:09 PM
Hi, I'm runing some scheduled vacuum jobs and would like to know how many files were deleted without making all the computation twice, with and without DRY RUN, is there a way to accomplish this?
Thanks!
03-07-2022 02:44 AM
SELECT * FROM (DESCRIBE HISTORY table)x WHERE operation IN ('VACUUM END', 'VACUUM START');
that gives us required information:
03-06-2022 09:51 PM
Hi @Alejandro Martinez :
I don't think we have any such command to get the statistics before vacuum and after vacuum.
Atleast I haven't come across any.
If you want to capture more details, may be you can write a function to capture the statistics as below.
Data files size:
Data files count:
Before:
var getDataFileSize = 0
val getDataFileCount = dbutils.fs.ls(<Your Table Path>").toList.size
dbutils.fs.ls(<Your Table Path>)
.foreach
{
file =>
getDataFileSize = getDataFileSize + file.size
}
After:
Repeat above
Lets see if other community members have better ideas on this.
03-07-2022 02:44 AM
03-07-2022 06:13 AM
Thank you! Not the solution I was looking for, but it seems nothing better exists...yet so going for that.
Thanks!!!
03-07-2022 06:22 AM
We have to enable logging to capture the logs for vacuum.
spark.conf.set("spark.databricks.delta.vacuum.logging.enabled","true")
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group