cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

saicharandeepb
New Contributor III

Hi everyone!

I recently designed a decision tree model to help recommend the most suitable VM types for different kinds of workloads in Databricks.

saicharandeepb_0-1762515348166.png

 

Thought Process Behind the Design:
Determining the optimal virtual machine (VM) for a workload is heavily dependent on:

  • The type of operations being performed (compute-heavy, memory-intensive, or storage-heavy)
  • The size of the data being handled
  • And of course, cost considerations

Based on this flow, users can employ a hit-and-trial approach while monitoring Spark metrics to validate whether the current VM type or worker configuration is optimal.
If metrics indicate CPU, memory, or disk bottlenecks, the VM size or type can be adjusted to better suit the workload.

Moreover, if Spark metrics show that both CPU and memory utilization stay consistently below 50%, switching to general-purpose compute VMs is recommended to reduce cost and avoid over-provisioning.

I’d love feedback from the community on:

  • How can this decision tree be further evolved or refined?
  • What would be the best way to incorporate recommendations for general-purpose VMs directly into this flow?

Your insights will help make this decision tree more dynamic and practical for real-world Databricks workloads!

Thanks in advance for your thoughts and suggestion

5 REPLIES 5

ManojkMohan
Honored Contributor

@saicharandeepb 

I would add an additional Cost vs Performance Prioritization, capturing mixed workloads ,decision branches that suggest switching to general-purpose VMs when utilization metrics consistently stay low

If you can share an editable version of your decision tree, i shall try color coding the delta , seems like a fun learning exercise to do

ManojkMohan_1-1762519508436.png

 

saicharandeepb
New Contributor III

Hi @ManojkMohan 
I’m sharing the editable version of my decision tree as requested — please feel free to make your color-coded enhancements and share it back once you’re done. I’d love to see your take on it! 😊
Here’s the link to the editable decision-tree file:
Editable Decision Tree 

Looking forward to your updated version.

Best regards,
Charan

ManojkMohan_1-1762540018653.png

 

Improvement Areas marked in Green 

The updated process starts with a clear separation of workload types: compute-heavy, memory-intensive, storage-heavy, and mixed/other

Instead of generic VM types , the new tree differentiates recommendations by whether the data size is above or below a defined threshold

"Is high IOPS needed?"
If yes, storage-optimized VMs are recommended, if no, general-purpose VMs are preferred

cost/performance as a priority decision node, with three branches: cost-minimizing, performance-maximizing, or balanced approaches.

jameswood32
New Contributor III

Your decision tree idea sounds solid! To improve it, consider including additional factors like network bandwidth, storage IOPS, and workload burst patterns. Also, think about cost-performance trade-offs and potential scaling requirements. Validating the tree with historical workload data or small pilot deployments can help fine-tune recommendations. Finally, keep it flexible so you can update VM options as cloud providers release new instance types.

James Wood

Coffee77
Contributor III

It looks interesting and I'll take a deeper loop! At first sight, as a suggestion I would include a new decision node to conditionally include VMs ready to "delta cache acceleration" or now "disk caching". These VMs have local SSD volumes so that they are very efficient when accessing and caching parquet files from delta tables in a massive way.

The disk cache (formerly known as "Delta cache") stores copies of remote data on the local disks (for example, SSD) of the virtual machines. The disk cache automatically detects when data files are created or deleted and updates its contents accordingly. The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when configuring your cluster. Such workers are enabled and configured for disk caching.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now