<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pandas. in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119579#M3372</link>
    <description>&lt;P&gt;Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like &lt;STRONG&gt;num_worker_nodes&lt;/STRONG&gt; &amp;gt; 1 or &lt;STRONG&gt;autoscale&lt;/STRONG&gt; with range of worker nodes if my codebase has mostly &lt;U&gt;pandas&lt;/U&gt; integration and performing parallelization with &lt;U&gt;joblib&lt;/U&gt; &lt;U&gt;parallel&lt;/U&gt;. No integration of pyspark.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;targets:
  dev_cluster: &amp;amp;dev_cluster
    new_cluster:
      cluster_log_conf:
        dbfs:
          destination: "dbfs:/FileStore/logs"
      spark_version: 14.3.x-scala2.12
      node_type_id: m5d.16xlarge
      custom_tags:
        clusterSource: forecasting
      data_security_mode: SINGLE_USER
      autotermination_minutes: 20
      autoscale:
        min_workers: 3
        max_workers: 20
      docker_image:
        url: "**************"
      aws_attributes:
        first_on_demand: 1
        instance_profile_arn: **************
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 50&lt;/LI-CODE&gt;</description>
    <pubDate>Mon, 19 May 2025 06:52:19 GMT</pubDate>
    <dc:creator>harishgehlot</dc:creator>
    <dc:date>2025-05-19T06:52:19Z</dc:date>
    <item>
      <title>Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pandas.</title>
      <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119579#M3372</link>
      <description>&lt;P&gt;Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like &lt;STRONG&gt;num_worker_nodes&lt;/STRONG&gt; &amp;gt; 1 or &lt;STRONG&gt;autoscale&lt;/STRONG&gt; with range of worker nodes if my codebase has mostly &lt;U&gt;pandas&lt;/U&gt; integration and performing parallelization with &lt;U&gt;joblib&lt;/U&gt; &lt;U&gt;parallel&lt;/U&gt;. No integration of pyspark.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;targets:
  dev_cluster: &amp;amp;dev_cluster
    new_cluster:
      cluster_log_conf:
        dbfs:
          destination: "dbfs:/FileStore/logs"
      spark_version: 14.3.x-scala2.12
      node_type_id: m5d.16xlarge
      custom_tags:
        clusterSource: forecasting
      data_security_mode: SINGLE_USER
      autotermination_minutes: 20
      autoscale:
        min_workers: 3
        max_workers: 20
      docker_image:
        url: "**************"
      aws_attributes:
        first_on_demand: 1
        instance_profile_arn: **************
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 50&lt;/LI-CODE&gt;</description>
      <pubDate>Mon, 19 May 2025 06:52:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119579#M3372</guid>
      <dc:creator>harishgehlot</dc:creator>
      <dc:date>2025-05-19T06:52:19Z</dc:date>
    </item>
    <item>
      <title>Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand</title>
      <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119660#M3377</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/164953"&gt;@harishgehlot&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;You are right in that it is not worth it to use workers if your code is mostly Pandas. Pandas runs primarily on the driver node, so no workers are needed as nothing is being distributed to the workers as they would be with Spark. I would just opt for a sufficiently large driver to make sure it performs well and so you don't run into out-of-memory errors.&lt;/P&gt;</description>
      <pubDate>Mon, 19 May 2025 22:12:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119660#M3377</guid>
      <dc:creator>Shua42</dc:creator>
      <dc:date>2025-05-19T22:12:38Z</dc:date>
    </item>
    <item>
      <title>Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand</title>
      <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119683#M3379</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/154481"&gt;@Shua42&lt;/a&gt;&amp;nbsp;for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;We should not opt for 3 - 10 worker nodes, it should be one for default only right ?&lt;/LI&gt;&lt;LI&gt;Let's suppose, it is very long hours of running job. SPOT INSTANCE is not advisable as it will auto terminated by cloud provider right ? We should opt for On-Demand Instance right ?&lt;/LI&gt;&lt;LI&gt;Can you suggest some workflow code for my need as I'm discussing here if possible only.&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;first_on_demand: 1 (In workflow template)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 May 2025 04:08:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119683#M3379</guid>
      <dc:creator>harishgehlot</dc:creator>
      <dc:date>2025-05-20T04:08:07Z</dc:date>
    </item>
    <item>
      <title>Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand</title>
      <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119767#M3380</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/164953"&gt;@harishgehlot&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.&lt;/P&gt;
&lt;P&gt;- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.&lt;/P&gt;
&lt;P&gt;- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add&amp;nbsp;availability: ON_DEMAND to ensure it's not using spot instances.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 May 2025 14:21:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119767#M3380</guid>
      <dc:creator>Shua42</dc:creator>
      <dc:date>2025-05-20T14:21:10Z</dc:date>
    </item>
    <item>
      <title>Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand</title>
      <link>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119788#M3384</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/154481"&gt;@Shua42&lt;/a&gt;&amp;nbsp;. You really helped me a lot.&lt;/P&gt;</description>
      <pubDate>Tue, 20 May 2025 16:13:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/shall-we-opt-for-multiple-worker-nodes-in-dab-workflow-template/m-p/119788#M3384</guid>
      <dc:creator>harishgehlot</dc:creator>
      <dc:date>2025-05-20T16:13:06Z</dc:date>
    </item>
  </channel>
</rss>

