- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-12-2022 01:35 PM
Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?
How can I know if I'll be getting any benefit from vacuum, without DRY RUN that runs all the process but skips the final part when it deletes the files.
And anyone have a script to share to iterate on each table and run a command, I find similar sample scripts but nothing in spark.
Thanks!
- Labels:
-
Table
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 05:22 AM
Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.
To use VACCUM, first run some analytics on behaviour of your data.
Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.
To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.
And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.
As such there's no script. But, you can try this:
%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
for table in table_name:
spark.sql(f"VACCUM {}".format(table))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 05:22 AM
Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days.
To use VACCUM, first run some analytics on behaviour of your data.
Identify % of operations that perform updates and deletes vs insert operations. This will tell you how many files aren't being referenced anymore, and a cleanup is required.
To understand the efficiency of VACCUM run queries before and after VACCUM operation. Does it improve the performance time? Maybe try doing VACCUM at different intervals 7/14/28 days.
And definitely use VACUUM if there's a requirement to wipe out the data of system for particular scenarios.
As such there's no script. But, you can try this:
%python
#get list of all tables
table_list = spark.sql("show tables from <database name>").select('tableName').rdd.map(lambda x : x[0]).collect()
for table in table_name:
spark.sql(f"VACCUM {}".format(table))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2022 06:38 AM
Thanks for the elaborate explanation . 👍🏼

