topic Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads in Data Engineering

Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

saicharandeepb — Fri, 07 Nov 2025 11:37:16 GMT

Hi everyone!

I recently designed a decision tree model to help recommend the most suitable VM types for different kinds of workloads in Databricks.

Thought Process Behind the Design:
Determining the optimal virtual machine (VM) for a workload is heavily dependent on:

The type of operations being performed (compute-heavy, memory-intensive, or storage-heavy)
The size of the data being handled
And of course, cost considerations

Based on this flow, users can employ a hit-and-trial approach while monitoring Spark metrics to validate whether the current VM type or worker configuration is optimal.
If metrics indicate CPU, memory, or disk bottlenecks, the VM size or type can be adjusted to better suit the workload.

Moreover, if Spark metrics show that both CPU and memory utilization stay consistently below 50%, switching to general-purpose compute VMs is recommended to reduce cost and avoid over-provisioning.

I’d love feedback from the community on:

How can this decision tree be further evolved or refined?
What would be the best way to incorporate recommendations for general-purpose VMs directly into this flow?

Your insights will help make this decision tree more dynamic and practical for real-world Databricks workloads!

Thanks in advance for your thoughts and suggestion

Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

ManojkMohan — Fri, 07 Nov 2025 12:50:00 GMT

@saicharandeepb

I would add an additional Cost vs Performance Prioritization, capturing mixed workloads ,decision branches that suggest switching to general-purpose VMs when utilization metrics consistently stay low

If you can share an editable version of your decision tree, i shall try color coding the delta , seems like a fun learning exercise to do

Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

saicharandeepb — Fri, 07 Nov 2025 13:26:08 GMT

Hi @ManojkMohan
I’m sharing the editable version of my decision tree as requested — please feel free to make your color-coded enhancements and share it back once you’re done. I’d love to see your take on it! 😊
Here’s the link to the editable decision-tree file:
Editable Decision Tree

Looking forward to your updated version.

Best regards,
Charan

Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

jameswood32 — Fri, 07 Nov 2025 13:31:49 GMT

Your decision tree idea sounds solid! To improve it, consider including additional factors like network bandwidth, storage IOPS, and workload burst patterns. Also, think about cost-performance trade-offs and potential scaling requirements. Validating the tree with historical workload data or small pilot deployments can help fine-tune recommendations. Finally, keep it flexible so you can update VM options as cloud providers release new instance types.

Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

ManojkMohan — Fri, 07 Nov 2025 18:27:43 GMT

Improvement Areas marked in Green

The updated process starts with a clear separation of workload types: compute-heavy, memory-intensive, storage-heavy, and mixed/other

Instead of generic VM types , the new tree differentiates recommendations by whether the data size is above or below a defined threshold

"Is high IOPS needed?"
If yes, storage-optimized VMs are recommended, if no, general-purpose VMs are preferred

cost/performance as a priority decision node, with three branches: cost-minimizing, performance-maximizing, or balanced approaches.

Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads

Coffee77 — Sun, 09 Nov 2025 18:34:40 GMT

It looks interesting and I'll take a deeper loop! At first sight, as a suggestion I would include a new decision node to conditionally include VMs ready to "delta cache acceleration" or now "disk caching". These VMs have local SSD volumes so that they are very efficient when accessing and caching parquet files from delta tables in a massive way.

The disk cache (formerly known as "Delta cache") stores copies of remote data on the local disks (for example, SSD) of the virtual machines. The disk cache automatically detects when data files are created or deleted and updates its contents accordingly. The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when configuring your cluster. Such workers are enabled and configured for disk caching.