<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138118#M50865</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;I’m sharing the &lt;STRONG&gt;editable version&lt;/STRONG&gt; of my decision tree as requested — please feel free to make your color-coded enhancements and share it back once you’re done. I’d love to see your take on it! &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;BR /&gt;Here’s the link to the editable decision-tree file:&lt;BR /&gt;&lt;A href="https://tinyurl.com/yycywcmr" target="_self"&gt;Editable Decision Tree&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Looking forward to your updated version.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Best regards,&lt;/STRONG&gt;&lt;BR /&gt;Charan&lt;/P&gt;</description>
    <pubDate>Fri, 07 Nov 2025 13:26:08 GMT</pubDate>
    <dc:creator>saicharandeepb</dc:creator>
    <dc:date>2025-11-07T13:26:08Z</dc:date>
    <item>
      <title>Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138101#M50863</link>
      <description>&lt;P&gt;Hi everyone!&lt;/P&gt;&lt;P&gt;I recently designed a &lt;STRONG&gt;decision tree model&lt;/STRONG&gt; to help recommend the most suitable &lt;STRONG&gt;VM types&lt;/STRONG&gt; for different kinds of &lt;STRONG&gt;workloads&lt;/STRONG&gt; in Databricks.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="saicharandeepb_0-1762515348166.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21440i6001DEA1A97201F1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="saicharandeepb_0-1762515348166.png" alt="saicharandeepb_0-1762515348166.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Thought Process Behind the Design:&lt;/STRONG&gt;&lt;BR /&gt;Determining the &lt;STRONG&gt;optimal virtual machine (VM)&lt;/STRONG&gt; for a workload is heavily dependent on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The &lt;STRONG&gt;type of operations&lt;/STRONG&gt; being performed (compute-heavy, memory-intensive, or storage-heavy)&lt;/LI&gt;&lt;LI&gt;The &lt;STRONG&gt;size of the data&lt;/STRONG&gt; being handled&lt;/LI&gt;&lt;LI&gt;And of course, &lt;STRONG&gt;cost considerations&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Based on this flow, users can employ a &lt;STRONG&gt;hit-and-trial approach&lt;/STRONG&gt; while monitoring &lt;STRONG&gt;Spark metrics&lt;/STRONG&gt; to validate whether the current VM type or worker configuration is optimal.&lt;BR /&gt;If metrics indicate CPU, memory, or disk bottlenecks, the &lt;STRONG&gt;VM size or type&lt;/STRONG&gt; can be adjusted to better suit the workload.&lt;/P&gt;&lt;P&gt;Moreover, if Spark metrics show that both &lt;STRONG&gt;CPU and memory utilization stay consistently below 50%&lt;/STRONG&gt;, switching to &lt;STRONG&gt;general-purpose compute VMs&lt;/STRONG&gt; is recommended to reduce cost and avoid over-provisioning.&lt;/P&gt;&lt;P&gt;I’d love feedback from the community on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;How can this decision tree be &lt;STRONG&gt;further evolved or refined&lt;/STRONG&gt;?&lt;/LI&gt;&lt;LI&gt;What would be the best way to &lt;STRONG&gt;incorporate recommendations for general-purpose VMs&lt;/STRONG&gt; directly into this flow?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Your insights will help make this decision tree more dynamic and practical for real-world Databricks workloads!&lt;/P&gt;&lt;P&gt;Thanks in advance for your thoughts and suggestion&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 11:37:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138101#M50863</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-11-07T11:37:16Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138109#M50864</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would add an additional&amp;nbsp;Cost vs Performance Prioritization,&amp;nbsp;capturing mixed workloads ,decision branches that suggest switching to general-purpose VMs when utilization metrics consistently stay low&lt;/P&gt;&lt;P&gt;If you can share an editable version of your decision tree, i shall try color coding the delta , seems like a fun learning exercise to do&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_1-1762519508436.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21442i9DD99944E7A20AE9/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_1-1762519508436.png" alt="ManojkMohan_1-1762519508436.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 12:50:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138109#M50864</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-11-07T12:50:00Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138118#M50865</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;I’m sharing the &lt;STRONG&gt;editable version&lt;/STRONG&gt; of my decision tree as requested — please feel free to make your color-coded enhancements and share it back once you’re done. I’d love to see your take on it! &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;BR /&gt;Here’s the link to the editable decision-tree file:&lt;BR /&gt;&lt;A href="https://tinyurl.com/yycywcmr" target="_self"&gt;Editable Decision Tree&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Looking forward to your updated version.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Best regards,&lt;/STRONG&gt;&lt;BR /&gt;Charan&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 13:26:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138118#M50865</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-11-07T13:26:08Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138120#M50866</link>
      <description>&lt;P&gt;Your decision tree idea sounds solid! To improve it, consider including additional factors like network bandwidth, storage IOPS, and workload burst patterns. Also, think about cost-performance trade-offs and potential scaling requirements. Validating the tree with historical workload data or small pilot deployments can help fine-tune recommendations. Finally, keep it flexible so you can update VM options as cloud providers release new instance types.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 13:31:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138120#M50866</guid>
      <dc:creator>jameswood32</dc:creator>
      <dc:date>2025-11-07T13:31:49Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138171#M50880</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_1-1762540018653.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21454i55FC4F354A8414F7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_1-1762540018653.png" alt="ManojkMohan_1-1762540018653.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Improvement Areas marked in Green&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;The updated process starts with a clear separation of workload types: compute-heavy, memory-intensive, storage-heavy, and mixed/other&lt;/P&gt;&lt;P&gt;Instead of generic VM types , the new tree differentiates recommendations by whether the data size is above or below a defined threshold&lt;/P&gt;&lt;P&gt;"Is high IOPS needed?"&lt;BR /&gt;If yes, storage-optimized VMs are recommended, if no, general-purpose VMs are preferred&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;cost/performance as a priority decision node, with three branches: cost-minimizing, performance-maximizing, or balanced approaches.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 18:27:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138171#M50880</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-11-07T18:27:43Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Suggestions: Designed a Decision Tree to Recommend Optimal VM Types for Workloads</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138309#M50911</link>
      <description>&lt;P&gt;It looks interesting and I'll take a deeper loop! At first sight,&lt;STRONG&gt; as a suggestion I would include a new decision node to conditionally include VMs ready to "delta cache acceleration" or now "disk caching&lt;/STRONG&gt;". These VMs have local &lt;EM&gt;SSD volumes&amp;nbsp;&lt;/EM&gt;so that they are very efficient when accessing and caching parquet files from delta tables in a massive way.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;The&amp;nbsp;&lt;A href="https://docs.azure.cn/en-us/databricks/optimizations/disk-cache" target="_blank" rel="noopener"&gt;disk cache&lt;/A&gt;&amp;nbsp;(formerly known as "Delta cache") stores &lt;STRONG&gt;copies of remote data on the local disks (for example, SSD) of the virtual machines&lt;/STRONG&gt;. The disk cache automatically detects when data files are created or deleted and updates its contents accordingly.&lt;STRONG&gt; The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes&lt;/STRONG&gt; when configuring your cluster. Such workers are enabled and configured for disk caching.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Nov 2025 18:34:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-suggestions-designed-a-decision-tree-to-recommend/m-p/138309#M50911</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-11-09T18:34:40Z</dc:date>
    </item>
  </channel>
</rss>

