Databricks Community

harishgehlot · ‎05-18-2025

Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like num_worker_nodes > 1 or autoscale with range of worker nodes if my codebase has mostly pandas integration and performing parallelization with joblib parallel. No integration of pyspark.

Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.

targets:
  dev_cluster: &dev_cluster
    new_cluster:
      cluster_log_conf:
        dbfs:
          destination: "dbfs:/FileStore/logs"
      spark_version: 14.3.x-scala2.12
      node_type_id: m5d.16xlarge
      custom_tags:
        clusterSource: forecasting
      data_security_mode: SINGLE_USER
      autotermination_minutes: 20
      autoscale:
        min_workers: 3
        max_workers: 20
      docker_image:
        url: "**************"
      aws_attributes:
        first_on_demand: 1
        instance_profile_arn: **************
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 50

harishgehlot · ‎05-19-2025

Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.

We should not opt for 3 - 10 worker nodes, it should be one for default only right ?
Let's suppose, it is very long hours of running job. SPOT INSTANCE is not advisable as it will auto terminated by cloud provider right ? We should opt for On-Demand Instance right ?
Can you suggest some workflow code for my need as I'm discussing here if possible only.

first_on_demand: 1 (In workflow template)

View solution in original post

Shua42 · ‎05-20-2025

Hi @harishgehlot ,

- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.

- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.

- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.