topic Re: Running vacuum on each table in Data Engineering

Running vacuum on each table

alejandrofm — Sat, 12 Feb 2022 21:35:42 GMT

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?

How can I know if I'll be getting any benefit from vacuum, without DRY RUN that runs all the process but skips the final part when it deletes the files.

And anyone have a script to share to iterate on each table and run a command, I find similar sample scripts but nothing in spark.

Thanks!

Re: Running vacuum on each table

AmanSehgal — Mon, 14 Feb 2022 13:22:57 GMT

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.

To use VACCUM, first run some analytics on behaviour of your data.

Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.

To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.

And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.

As such there's no script. But, you can try this:

%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
 
for table in table_name:
  spark.sql(f"VACCUM {}".format(table))

Re: Running vacuum on each table

sapote — Mon, 14 Feb 2022 14:38:38 GMT

Thanks for the elaborate explanation . 👍🏼