yesterday
Hey Team,
I have a few questions regarding VACUUM and VACUUM LITE
1. How do they work internally, do both of them scan the entire table storage directory?
2. How should we use these in our prod jobs..I mean should we always run VACUUM LITE or VACUUM or a combination?
3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it and don't see anything in the delta history about it?
yesterday
Hey @analyticsnerd !
Below are the answers to your questions:
1. How do they work internally? Do both of them scan the entire table storage directory?
VACUUM OR (VACCUM FULL) does a full table directory listing to identify files to be deleted.
VACUUM LITE relies on the remove-file action from the delta log to identify files for deletion, so it doesn't have to scan the entire directory to identify files to be deleted. Also, this is incremental as it stores the version which it last read/referenced for vacuum inside _last_vacuum_info the directory within the delta log. Available from 16.1
2. How should we use these in our prod jobs...I mean, should we always run VACUUM LITE, VACUUM, or a combination?
Below are a few best practices:
3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it, and don't see anything in the delta history about it.
This is a great question, and I agree it's a pain point.I will work on this to include the type of VACUUM in the delta log.
But until then, there is a hack way to know which VACUUM has been run , by lokking at the Driver logs. Below are the example logs:
25/08/15 04:35:42 INFO VacuumCommand: Starting garbage collection (dryRun = true) of untracked files older than 15 Aug 2025 02:35:42 GMT in dbfs:/user/hive/warehouse/ak_db.db/delta_16_4_optimize
25/07/31 10:48:07 INFO VacuumCommand: Found 1076 files (731672207 bytes) and directories in a total of 1 directories that are safe to delete. Vacuum stats: DeltaVacuumStats(true,None,604800000,1753354083869,1,0,0,0,3795,0,1753958883805,1753958887875,4,4,4,false,0,0,1078,None,None,LITE)
25/07/31 10:59:46 INFO VacuumCommand: Deleted 1076 files (731672207 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1753959524951,1,1093,1076,731672207,51387,3939,1753959524900,1753959585658,4,4,4,false,0,0,1086,Some(0),Some(1086),LITE)
From the above, you can see the work LITE in the VACUUM stats, which tells it's a VACUUM lite we ran. But i still agree that we need to have it in the Delta history, which I will work on and get back soon with an update .
yesterday
Hey @analyticsnerd !
Below are the answers to your questions:
1. How do they work internally? Do both of them scan the entire table storage directory?
VACUUM OR (VACCUM FULL) does a full table directory listing to identify files to be deleted.
VACUUM LITE relies on the remove-file action from the delta log to identify files for deletion, so it doesn't have to scan the entire directory to identify files to be deleted. Also, this is incremental as it stores the version which it last read/referenced for vacuum inside _last_vacuum_info the directory within the delta log. Available from 16.1
2. How should we use these in our prod jobs...I mean, should we always run VACUUM LITE, VACUUM, or a combination?
Below are a few best practices:
3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it, and don't see anything in the delta history about it.
This is a great question, and I agree it's a pain point.I will work on this to include the type of VACUUM in the delta log.
But until then, there is a hack way to know which VACUUM has been run , by lokking at the Driver logs. Below are the example logs:
25/08/15 04:35:42 INFO VacuumCommand: Starting garbage collection (dryRun = true) of untracked files older than 15 Aug 2025 02:35:42 GMT in dbfs:/user/hive/warehouse/ak_db.db/delta_16_4_optimize
25/07/31 10:48:07 INFO VacuumCommand: Found 1076 files (731672207 bytes) and directories in a total of 1 directories that are safe to delete. Vacuum stats: DeltaVacuumStats(true,None,604800000,1753354083869,1,0,0,0,3795,0,1753958883805,1753958887875,4,4,4,false,0,0,1078,None,None,LITE)
25/07/31 10:59:46 INFO VacuumCommand: Deleted 1076 files (731672207 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1753959524951,1,1093,1076,731672207,51387,3939,1753959524900,1753959585658,4,4,4,false,0,0,1086,Some(0),Some(1086),LITE)
From the above, you can see the work LITE in the VACUUM stats, which tells it's a VACUUM lite we ran. But i still agree that we need to have it in the Delta history, which I will work on and get back soon with an update .
yesterday
@analyticsnerd , I have created a GitHub issue, and I will work on it later
Issue: https://github.com/delta-io/delta/issues/5586
yesterday
Hi @K_Anudeep
Thanks a lot for explaining it clearly!!
Also thanks a lot for owing it and creating a GitHub issue and also providing a workaround to classify the VACUUM type.
I hope we get flag seen in the delta history soon!!
3 hours ago
Thanks @K_Anudeep for the insights on VACUUM operations.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now