cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Machine Type for VACUUM operation

NOOR_BASHASHAIK
Contributor

Dear all

I have a workflow with 2 tasks : one that does OPTIMIZE, followed by one that does VACUUM. I used a cluster with F32s driver and F64s - 8 workers (auto-scaling enabled). All 8 workers are launched by Databricks as soon as OPTIMIZE starts.

 

As per documentation, we should use F series machines for OPTIMIZE & VACUUM operations as they are compute intensive. But, when I use F series, during the whole VACUUM step execution time, CPU is barely used, both on driver side as well worker side. Driver side, it is little bit high - around 30% - as it does I think the actual delete operation of files. In contrast, I have memory touch 50% both for driver worker & driver nodes.
You can notice from below screenshot (captured for one of the workers buts same pattern for rest) that CPU usage suddenly goes down but memory is used to a decent extent. This is when VACUUM has started.  For OPTIMIZE step, CPU & Memory are well used both on worker and driver nodes. I expect Databricks to scale down when VACUUM started as the hardware is not fully used but may be it does not as memory is used well (only CPU is idle)....
 
Please advise the best set-up here.
NOOR_BASHASHAIK_0-1710268182562.png

 

 
 
 
 
2 REPLIES 2

Hi @Retired_mod I request you to read my post carefully once again to better understand my problem statement; may be, that will lead to a more meaningful discussion beneficial for all. I already said I do VACUUM after OPTIMIZE. I already said I use F series. I already said I use F64 machine for workers with 8 workers in auto-scale mode.

 

ArturOA
New Contributor II

Hi,

were you able to get any useful help on this?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group