cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

VACUUM vs VACUUM LITE

analyticsnerd
Visitor

Hey Team,

I have a few questions regarding VACUUM and VACUUM LITE

1. How do they work internally, do both of them scan the entire table storage directory?

2. How should we use these in our prod jobs..I mean should we always run VACUUM LITE or VACUUM or a combination?

3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it and don't see anything in the delta history about it?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hey @analyticsnerd !

Below are the answers to your questions:

1. How do they work internally? Do both of them scan the entire table storage directory?

VACUUM OR (VACCUM FULL) does a full table directory listing to identify files to be deleted.

VACUUM LITE relies on the remove-file action from the delta log to identify files for deletion, so it doesn't have to scan the entire directory to identify files to be deleted. Also, this is incremental as it stores the version which it last read/referenced for vacuum inside _last_vacuum_info the directory within the delta log. Available from 16.1

 

2. How should we use these in our prod jobs...I mean, should we always run VACUUM LITE, VACUUM, or a combination?

Below are a few best practices:

  • Run VACUUM daily/regularly on the tables to minimise the retention of unused files.
  • Avoid writing very small and too many files, which will cause high listing times whenever VACCUM is executed
  • Do not run VACUUM with RETENTION 0 hours
  • Also, these best practices as well before running a VACUUM. Link: https://docs.databricks.com/en/delta/vacuum.html#what-size-cluster-does-vacuum-need
  • Run VACUUM LITE daily/weekly and VACUUM once in a while(a month or quarter) as a best pactice

 

3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it, and don't see anything in the delta history about it.

This is a great question, and I agree it's a pain point.I will work on this to include the type of VACUUM in the delta log.

But until then, there is a hack way to know which VACUUM has been run ,  by lokking at the Driver logs. Below are the example logs:

25/08/15 04:35:42 INFO VacuumCommand: Starting garbage collection (dryRun = true) of untracked files older than 15 Aug 2025 02:35:42 GMT in dbfs:/user/hive/warehouse/ak_db.db/delta_16_4_optimize

25/07/31 10:48:07 INFO VacuumCommand: Found 1076 files (731672207 bytes) and directories in a total of 1 directories that are safe to delete. Vacuum stats: DeltaVacuumStats(true,None,604800000,1753354083869,1,0,0,0,3795,0,1753958883805,1753958887875,4,4,4,false,0,0,1078,None,None,LITE)

25/07/31 10:59:46 INFO VacuumCommand: Deleted 1076 files (731672207 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1753959524951,1,1093,1076,731672207,51387,3939,1753959524900,1753959585658,4,4,4,false,0,0,1086,Some(0),Some(1086),LITE)

From the above, you can see the work LITE in the VACUUM stats, which tells it's a VACUUM lite we ran. But i still agree that we need to have it in the Delta history, which I will work on and get back soon with an update .

Anudeep

View solution in original post

4 REPLIES 4

K_Anudeep
Databricks Employee
Databricks Employee

Hey @analyticsnerd !

Below are the answers to your questions:

1. How do they work internally? Do both of them scan the entire table storage directory?

VACUUM OR (VACCUM FULL) does a full table directory listing to identify files to be deleted.

VACUUM LITE relies on the remove-file action from the delta log to identify files for deletion, so it doesn't have to scan the entire directory to identify files to be deleted. Also, this is incremental as it stores the version which it last read/referenced for vacuum inside _last_vacuum_info the directory within the delta log. Available from 16.1

 

2. How should we use these in our prod jobs...I mean, should we always run VACUUM LITE, VACUUM, or a combination?

Below are a few best practices:

  • Run VACUUM daily/regularly on the tables to minimise the retention of unused files.
  • Avoid writing very small and too many files, which will cause high listing times whenever VACCUM is executed
  • Do not run VACUUM with RETENTION 0 hours
  • Also, these best practices as well before running a VACUUM. Link: https://docs.databricks.com/en/delta/vacuum.html#what-size-cluster-does-vacuum-need
  • Run VACUUM LITE daily/weekly and VACUUM once in a while(a month or quarter) as a best pactice

 

3. How can we differentiate which operation( whether FULL or LITE )was performed by on the table when VACUUM is run? I tried running it, and don't see anything in the delta history about it.

This is a great question, and I agree it's a pain point.I will work on this to include the type of VACUUM in the delta log.

But until then, there is a hack way to know which VACUUM has been run ,  by lokking at the Driver logs. Below are the example logs:

25/08/15 04:35:42 INFO VacuumCommand: Starting garbage collection (dryRun = true) of untracked files older than 15 Aug 2025 02:35:42 GMT in dbfs:/user/hive/warehouse/ak_db.db/delta_16_4_optimize

25/07/31 10:48:07 INFO VacuumCommand: Found 1076 files (731672207 bytes) and directories in a total of 1 directories that are safe to delete. Vacuum stats: DeltaVacuumStats(true,None,604800000,1753354083869,1,0,0,0,3795,0,1753958883805,1753958887875,4,4,4,false,0,0,1078,None,None,LITE)

25/07/31 10:59:46 INFO VacuumCommand: Deleted 1076 files (731672207 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1753959524951,1,1093,1076,731672207,51387,3939,1753959524900,1753959585658,4,4,4,false,0,0,1086,Some(0),Some(1086),LITE)

From the above, you can see the work LITE in the VACUUM stats, which tells it's a VACUUM lite we ran. But i still agree that we need to have it in the Delta history, which I will work on and get back soon with an update .

Anudeep

K_Anudeep
Databricks Employee
Databricks Employee

@analyticsnerd , I have created a GitHub issue, and I will work on it later 

Issuehttps://github.com/delta-io/delta/issues/5586

Anudeep

Hi @K_Anudeep 

Thanks a lot for explaining it clearly!!

Also thanks a lot for owing it and creating a GitHub issue and also providing a workaround to classify the VACUUM type.

I hope we get flag seen in the delta history soon!!

Raman_Unifeye
Contributor III

Thanks @K_Anudeep for the insights on VACUUM operations.


RG #Driving Business Outcomes with Data Intelligence