cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Running vacuum on each table

alejandrofm
Valued Contributor

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?

How can I know if I'll be getting any benefit from vacuum, without DRY RUN that runs all the process but skips the final part when it deletes the files.

And anyone have a script to share to iterate on each table and run a command, I find similar sample scripts but nothing in spark.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

AmanSehgal
Honored Contributor III

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

View solution in original post

2 REPLIES 2

AmanSehgal
Honored Contributor III

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

sapote
New Contributor III

Thanks for the elaborate explanation . 👍🏼

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group