cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Should Vacuum Be Tied to Workflows?

OldManCoder
New Contributor II

I have a process expected to run every two weeks. Throughout the process (~30 notebooks), when I write to a table for the last time in the overall process, I run my vacuum such as below - I'm never running a vac against the same table twice.  I've noticed this cleanup process can add substantial time to my notebook run times. My questions;
1. Should I put all cleanups in a separate notebook and run it in a separate workflow.

2. Am I doing it the right way?

#databricksCatalog contains catalog/schema for each modification if it changes. 

myTable = databricksCatalog + "my_table_name"
spark.sql(f"VACUUM {myTable} RETAIN 504 HOURS")
spark.sql(f"OPTIMIZE {myTable}")
spark.sql(f"ANALYZE TABLE {myTable} COMPUTE DELTA STATISTICS")

 Do I need to vac/opt/analyse each time? 

1 ACCEPTED SOLUTION

Accepted Solutions

VZLA
Databricks Employee
Databricks Employee

Hi @OldManCoder , thanks for your question!

1) Yes, separating cleanup tasks into a dedicated workflow is often more efficient. Here's why:

  • Performance: Vacuum and optimization are resource-intensive operations. Running them inline with your primary workflow can significantly increase runtime.
  • Scheduling Flexibility: By decoupling cleanup, you can schedule it during off-peak hours to minimize the impact on compute resources and other workloads.
  • Centralized Management: A separate workflow allows you to consolidate cleanup for multiple tables, making it easier to manage and monitor.

2) Not necessarily. Here’s when each operation is useful:

  • Vacuum: Use only when you expect significant deleted or obsolete data. Vacuum removes old files based on the RETENTION period and is usually needed less frequently (e.g., weekly or biweekly, depending on your data deletion patterns). For a process running every two weeks, a vacuum at the end of the process in a separate workflow should suffice.
  • Optimize: Use after significant write operations, especially if your queries involve heavy scans or filters. This operation improves file compaction but doesn’t need to be done after every write unless your table sees frequent updates or inserts.
  • Analyze Table: Use only when query performance depends on up-to-date table statistics. Delta Lake uses automated statistics collection for most operations, so explicit statistics computation might not be necessary unless you observe query degradation.

So you could create a separate notebook or workflow to handle maintenance tasks for all tables after your main process completes.

Frequency:
- Vacuum: Run periodically (e.g., biweekly or monthly) depending on the volume of deleted data.
- Optimize: Run after large write operations but not necessarily after every batch. For stable tables, consider weekly or monthly.
- Analyze: Run only if query performance suggests stale statistics.
Retention Period: Ensure VACUUM RETAIN period aligns with your recovery needs. The default of 7 days (168 hours) is often sufficient unless you need longer recovery windows.

The cleanup workflow could be something like:

 

tables_to_clean = [
    "catalog.schema.table1",
    "catalog.schema.table2"
]

for table in tables_to_clean:
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")  # Adjust retention as needed
    spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"ANALYZE TABLE {table} COMPUTE DELTA STATISTICS")

 

Hope it helps!

View solution in original post

2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

Yes, it is recommended to run cleanup operations such as VACUUM, OPTIMIZE, and ANALYZE TABLE in a separate notebook and workflow. This helps in isolating the cleanup tasks from the main data processing tasks, reducing the overall runtime of your primary notebooks. Running these operations in a separate workflow also allows you to schedule them during off-peak hours, minimizing the impact on your main data processing workflows.

You do not need to run these commands every time you write to a table. Instead, you can schedule these operations to run periodically based on your data update frequency and workload requirements.

VZLA
Databricks Employee
Databricks Employee

Hi @OldManCoder , thanks for your question!

1) Yes, separating cleanup tasks into a dedicated workflow is often more efficient. Here's why:

  • Performance: Vacuum and optimization are resource-intensive operations. Running them inline with your primary workflow can significantly increase runtime.
  • Scheduling Flexibility: By decoupling cleanup, you can schedule it during off-peak hours to minimize the impact on compute resources and other workloads.
  • Centralized Management: A separate workflow allows you to consolidate cleanup for multiple tables, making it easier to manage and monitor.

2) Not necessarily. Here’s when each operation is useful:

  • Vacuum: Use only when you expect significant deleted or obsolete data. Vacuum removes old files based on the RETENTION period and is usually needed less frequently (e.g., weekly or biweekly, depending on your data deletion patterns). For a process running every two weeks, a vacuum at the end of the process in a separate workflow should suffice.
  • Optimize: Use after significant write operations, especially if your queries involve heavy scans or filters. This operation improves file compaction but doesn’t need to be done after every write unless your table sees frequent updates or inserts.
  • Analyze Table: Use only when query performance depends on up-to-date table statistics. Delta Lake uses automated statistics collection for most operations, so explicit statistics computation might not be necessary unless you observe query degradation.

So you could create a separate notebook or workflow to handle maintenance tasks for all tables after your main process completes.

Frequency:
- Vacuum: Run periodically (e.g., biweekly or monthly) depending on the volume of deleted data.
- Optimize: Run after large write operations but not necessarily after every batch. For stable tables, consider weekly or monthly.
- Analyze: Run only if query performance suggests stale statistics.
Retention Period: Ensure VACUUM RETAIN period aligns with your recovery needs. The default of 7 days (168 hours) is often sufficient unless you need longer recovery windows.

The cleanup workflow could be something like:

 

tables_to_clean = [
    "catalog.schema.table1",
    "catalog.schema.table2"
]

for table in tables_to_clean:
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")  # Adjust retention as needed
    spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"ANALYZE TABLE {table} COMPUTE DELTA STATISTICS")

 

Hope it helps!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group