cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Running vacuum on each table

alejandrofm
Valued Contributor

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?

How can I know if I'll be getting any benefit from vacuum, without DRY RUN that runs all the process but skips the final part when it deletes the files.

And anyone have a script to share to iterate on each table and run a command, I find similar sample scripts but nothing in spark.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

AmanSehgal
Honored Contributor III

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

View solution in original post

2 REPLIES 2

AmanSehgal
Honored Contributor III

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

sapote
New Contributor II

Thanks for the elaborate explanation . 👍🏼

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.