Databricks Community

OldManCoder · ‎12-09-2024

I have a process expected to run every two weeks. Throughout the process (~30 notebooks), when I write to a table for the last time in the overall process, I run my vacuum such as below - I'm never running a vac against the same table twice. I've noticed this cleanup process can add substantial time to my notebook run times. My questions;
1. Should I put all cleanups in a separate notebook and run it in a separate workflow.

2. Am I doing it the right way?

#databricksCatalog contains catalog/schema for each modification if it changes. 

myTable = databricksCatalog + "my_table_name"
spark.sql(f"VACUUM {myTable} RETAIN 504 HOURS")
spark.sql(f"OPTIMIZE {myTable}")
spark.sql(f"ANALYZE TABLE {myTable} COMPUTE DELTA STATISTICS")

Do I need to vac/opt/analyse each time?

VZLA · ‎12-09-2024

Hi @OldManCoder , thanks for your question!

1) Yes, separating cleanup tasks into a dedicated workflow is often more efficient. Here's why:

Performance: Vacuum and optimization are resource-intensive operations. Running them inline with your primary workflow can significantly increase runtime.
Scheduling Flexibility: By decoupling cleanup, you can schedule it during off-peak hours to minimize the impact on compute resources and other workloads.
Centralized Management: A separate workflow allows you to consolidate cleanup for multiple tables, making it easier to manage and monitor.

2) Not necessarily. Here’s when each operation is useful:

Vacuum: Use only when you expect significant deleted or obsolete data. Vacuum removes old files based on the RETENTION period and is usually needed less frequently (e.g., weekly or biweekly, depending on your data deletion patterns). For a process running every two weeks, a vacuum at the end of the process in a separate workflow should suffice.
Optimize: Use after significant write operations, especially if your queries involve heavy scans or filters. This operation improves file compaction but doesn’t need to be done after every write unless your table sees frequent updates or inserts.
Analyze Table: Use only when query performance depends on up-to-date table statistics. Delta Lake uses automated statistics collection for most operations, so explicit statistics computation might not be necessary unless you observe query degradation.

So you could create a separate notebook or workflow to handle maintenance tasks for all tables after your main process completes.

Frequency:
- Vacuum: Run periodically (e.g., biweekly or monthly) depending on the volume of deleted data.
- Optimize: Run after large write operations but not necessarily after every batch. For stable tables, consider weekly or monthly.
- Analyze: Run only if query performance suggests stale statistics.
Retention Period: Ensure VACUUM RETAIN period aligns with your recovery needs. The default of 7 days (168 hours) is often sufficient unless you need longer recovery windows.

The cleanup workflow could be something like:

tables_to_clean = [
    "catalog.schema.table1",
    "catalog.schema.table2"
]

for table in tables_to_clean:
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")  # Adjust retention as needed
    spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"ANALYZE TABLE {table} COMPUTE DELTA STATISTICS")

Hope it helps!

View solution in original post

Walter_C · ‎12-09-2024

Yes, it is recommended to run cleanup operations such as VACUUM, OPTIMIZE, and ANALYZE TABLE in a separate notebook and workflow. This helps in isolating the cleanup tasks from the main data processing tasks, reducing the overall runtime of your primary notebooks. Running these operations in a separate workflow also allows you to schedule them during off-peak hours, minimizing the impact on your main data processing workflows.

You do not need to run these commands every time you write to a table. Instead, you can schedule these operations to run periodically based on your data update frequency and workload requirements.

VZLA · ‎12-09-2024