โ03-06-2022 05:09 PM
Hi, I'm runing some scheduled vacuum jobs and would like to know how many files were deleted without making all the computation twice, with and without DRY RUN, is there a way to accomplish this?
Thanks!
โ03-07-2022 02:44 AM
SELECT * FROM (DESCRIBE HISTORY table)x WHERE operation IN ('VACUUM END', 'VACUUM START');
that gives us required information:
โ03-06-2022 09:51 PM
Hi @Alejandro Martinezโ :
I don't think we have any such command to get the statistics before vacuum and after vacuum.
Atleast I haven't come across any.
If you want to capture more details, may be you can write a function to capture the statistics as below.
Data files size:
Data files count:
Before:
var getDataFileSize = 0
val getDataFileCount = dbutils.fs.ls(<Your Table Path>").toList.size
dbutils.fs.ls(<Your Table Path>)
.foreach
{
file =>
getDataFileSize = getDataFileSize + file.size
}
After:
Repeat above
Lets see if other community members have better ideas on this.
โ03-07-2022 02:44 AM
โ03-07-2022 06:13 AM
Thank you! Not the solution I was looking for, but it seems nothing better exists...yet so going for that.
Thanks!!!
โ03-07-2022 06:22 AM
We have to enable logging to capture the logs for vacuum.
spark.conf.set("spark.databricks.delta.vacuum.logging.enabled","true")
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group