Dear all
I have a workflow with 2 tasks : one that does OPTIMIZE, followed by one that does VACUUM. I used a cluster with F32s driver and F64s - 8 workers (auto-scaling enabled). All 8 workers are launched by Databricks as soon as OPTIMIZE starts.
As per documentation, we should use F series machines for OPTIMIZE & VACUUM operations as they are compute intensive. But, when I use F series, during the whole VACUUM step execution time, CPU is barely used, both on driver side as well worker side. Driver side, it is little bit high - around 30% - as it does I think the actual delete operation of files. In contrast, I have memory touch 50% both for driver worker & driver nodes.
You can notice from below screenshot (captured for one of the workers buts same pattern for rest) that CPU usage suddenly goes down but memory is used to a decent extent. This is when VACUUM has started. For OPTIMIZE step, CPU & Memory are well used both on worker and driver nodes. I expect Databricks to scale down when VACUUM started as the hardware is not fully used but may be it does not as memory is used well (only CPU is idle)....
Please advise the best set-up here.