Machine type for different operations in Azure Databricks

noorbasha534
Valued Contributor II

Dear all

do we have a general recommendation for the virtual machine type to be used for different operations in Azure Databricks? we are looking for the below -

1. VACUUM 2. OPTIMIZE 3. ANALYZE STATS 4. DESCRIBE TABLE HISTORY

I understood at a high level from the documentation that since VACUUM lists the files first which is a CPU intensive operation, it is advised to go for F series etc.

Appreciate if we can have the recommendation with some rationale. Thanks

szymon_dybczak
Esteemed Contributor III

Hi @noorbasha534 ,

Here's a general recommendation from Databricks. So they're recommending to run OPTIMIZE on compute optimized VMs and VACUUM on general purpose.

Comprehensive Guide to Optimize Data Workloads | Databricks

But as you said, VACCUM is compute intensive operation, so if you run it on F series that is also good approach. They even recommended to use that type of compute below:

szymon_dybczak_0-1753707124150.png

 


VACUUM best practices on Delta Lake - Databricks

As of ANALAYZE, this one collects metadata about the data, it's primarly I/O bound. General-purpose compute will be a good fit here in my opinion.