cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Machine Type for VACUUM operation

NOOR_BASHASHAIK
Contributor

Dear all

I have a workflow with 2 tasks : one that does OPTIMIZE, followed by one that does VACUUM. I used a cluster with F32s driver and F64s - 8 workers (auto-scaling enabled). All 8 workers are launched by Databricks as soon as OPTIMIZE starts.

 

As per documentation, we should use F series machines for OPTIMIZE & VACUUM operations as they are compute intensive. But, when I use F series, during the whole VACUUM step execution time, CPU is barely used, both on driver side as well worker side. Driver side, it is little bit high - around 30% - as it does I think the actual delete operation of files. In contrast, I have memory touch 50% both for driver worker & driver nodes.
You can notice from below screenshot (captured for one of the workers buts same pattern for rest) that CPU usage suddenly goes down but memory is used to a decent extent. This is when VACUUM has started.  For OPTIMIZE step, CPU & Memory are well used both on worker and driver nodes. I expect Databricks to scale down when VACUUM started as the hardware is not fully used but may be it does not as memory is used well (only CPU is idle)....
 
Please advise the best set-up here.
NOOR_BASHASHAIK_0-1710268182562.png

 

 
 
 
 
3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @NOOR_BASHASHAIKWhen optimizing and vacuuming Delta tables in Databricks, the order of operations matters. Let’s explore the best approach for your workflow:

  1. Optimize:

    • The OPTIMIZE command compacts files in Delta tables, reducing their size. It’s a compute-intensive operation.
    • By running OPTIMIZE first, you ensure that the data is compacted, which can improve query performance.
    • The default size for optimized files is 1 GB.
    • After optimization, some old files may still exist, marked as obsoleted.
  2. Vacuum:

    • The VACUUM command removes old files that are no longer needed or have been marked as obsoleted by the OPTIMIZE command.
    • It’s essential to run VACUUM after OPTIMIZE to clean up these obsolete files.
    • The default retention threshold for VACUUM is 7 days.
    • Databricks automatically triggers VACUUM operations as data is written.

Recommended Approach:

  • Run OPTIMIZE first, compacting the files.
  • Then, follow it with VACUUM to remove unnecessary files.
  • This order ensures that you have optimized data available faster, and then the cleanup process occurs.

Regarding your hardware setup:

 

Hi @Kaniz_Fatma I request you to read my post carefully once again to better understand my problem statement; may be, that will lead to a more meaningful discussion beneficial for all. I already said I do VACUUM after OPTIMIZE. I already said I use F series. I already said I use F64 machine for workers with 8 workers in auto-scale mode.

 

ArturOA
New Contributor II

Hi,

were you able to get any useful help on this?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group