Friday
Hi everyone!
I recently designed a decision tree model to help recommend the most suitable VM types for different kinds of workloads in Databricks.
Thought Process Behind the Design:
Determining the optimal virtual machine (VM) for a workload is heavily dependent on:
Based on this flow, users can employ a hit-and-trial approach while monitoring Spark metrics to validate whether the current VM type or worker configuration is optimal.
If metrics indicate CPU, memory, or disk bottlenecks, the VM size or type can be adjusted to better suit the workload.
Moreover, if Spark metrics show that both CPU and memory utilization stay consistently below 50%, switching to general-purpose compute VMs is recommended to reduce cost and avoid over-provisioning.
I’d love feedback from the community on:
Your insights will help make this decision tree more dynamic and practical for real-world Databricks workloads!
Thanks in advance for your thoughts and suggestion
Friday - last edited Friday
I would add an additional Cost vs Performance Prioritization, capturing mixed workloads ,decision branches that suggest switching to general-purpose VMs when utilization metrics consistently stay low
If you can share an editable version of your decision tree, i shall try color coding the delta , seems like a fun learning exercise to do
Friday
Hi @ManojkMohan
I’m sharing the editable version of my decision tree as requested — please feel free to make your color-coded enhancements and share it back once you’re done. I’d love to see your take on it! 😊
Here’s the link to the editable decision-tree file:
Editable Decision Tree
Looking forward to your updated version.
Best regards,
Charan
Friday - last edited Friday
Improvement Areas marked in Green
The updated process starts with a clear separation of workload types: compute-heavy, memory-intensive, storage-heavy, and mixed/other
Instead of generic VM types , the new tree differentiates recommendations by whether the data size is above or below a defined threshold
"Is high IOPS needed?"
If yes, storage-optimized VMs are recommended, if no, general-purpose VMs are preferred
cost/performance as a priority decision node, with three branches: cost-minimizing, performance-maximizing, or balanced approaches.
Friday
Your decision tree idea sounds solid! To improve it, consider including additional factors like network bandwidth, storage IOPS, and workload burst patterns. Also, think about cost-performance trade-offs and potential scaling requirements. Validating the tree with historical workload data or small pilot deployments can help fine-tune recommendations. Finally, keep it flexible so you can update VM options as cloud providers release new instance types.
Sunday
It looks interesting and I'll take a deeper loop! At first sight, as a suggestion I would include a new decision node to conditionally include VMs ready to "delta cache acceleration" or now "disk caching". These VMs have local SSD volumes so that they are very efficient when accessing and caching parquet files from delta tables in a massive way.
The disk cache (formerly known as "Delta cache") stores copies of remote data on the local disks (for example, SSD) of the virtual machines. The disk cache automatically detects when data files are created or deleted and updates its contents accordingly. The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when configuring your cluster. Such workers are enabled and configured for disk caching.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now