Databricks Community

NOOR_BASHASHAIK · ‎03-12-2024

Dear all

I have a workflow with 2 tasks : one that does OPTIMIZE, followed by one that does VACUUM. I used a cluster with F32s driver and F64s - 8 workers (auto-scaling enabled). All 8 workers are launched by Databricks as soon as OPTIMIZE starts.

As per documentation, we should use F series machines for OPTIMIZE & VACUUM operations as they are compute intensive. But, when I use F series, during the whole VACUUM step execution time, CPU is barely used, both on driver side as well worker side. Driver side, it is little bit high - around 30% - as it does I think the actual delete operation of files. In contrast, I have memory touch 50% both for driver worker & driver nodes.

You can notice from below screenshot (captured for one of the workers buts same pattern for rest) that CPU usage suddenly goes down but memory is used to a decent extent. This is when VACUUM has started. For OPTIMIZE step, CPU & Memory are well used both on worker and driver nodes. I expect Databricks to scale down when VACUUM started as the hardware is not fully used but may be it does not as memory is used well (only CPU is idle)....

Please advise the best set-up here.

NOOR_BASHASHAIK · ‎03-13-2024

Hi @Retired_mod I request you to read my post carefully once again to better understand my problem statement; may be, that will lead to a more meaningful discussion beneficial for all. I already said I do VACUUM after OPTIMIZE. I already said I use F series. I already said I use F64 machine for workers with 8 workers in auto-scale mode.