Databricks Community

alejandrofm · ‎02-12-2022

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?

How can I know if I'll be getting any benefit from vacuum, without DRY RUN that runs all the process but skips the final part when it deletes the files.

And anyone have a script to share to iterate on each table and run a command, I find similar sample scripts but nothing in spark.

Thanks!

AmanSehgal · ‎02-14-2022

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

View solution in original post