cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Machine type for different operations in Azure Databricks

noorbasha534
Valued Contributor II

Dear all

do we have a general recommendation for the virtual machine type to be used for different operations in Azure Databricks? we are looking for the below -

1. VACUUM 2. OPTIMIZE 3. ANALYZE STATS 4. DESCRIBE TABLE HISTORY

I understood at a high level from the documentation that since VACUUM lists the files first which is a CPU intensive operation, it is advised to go for F series etc.

Appreciate if we can have the recommendation with some rationale. Thanks

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @noorbasha534 ,

Here's a general recommendation from Databricks. So they're recommending to run OPTIMIZE on compute optimized VMs and VACUUM on general purpose.

Comprehensive Guide to Optimize Data Workloads | Databricks

But as you said, VACCUM is compute intensive operation, so if you run it on F series that is also good approach. They even recommended to use that type of compute below:

szymon_dybczak_0-1753707124150.png

 


VACUUM best practices on Delta Lake - Databricks

As of ANALAYZE, this one collects metadata about the data, it's primarly I/O bound. General-purpose compute will be a good fit here in my opinion.

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now