cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pandas.

harishgehlot
New Contributor III

Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like num_worker_nodes > 1 or autoscale with range of worker nodes if my codebase has mostly pandas integration and performing parallelization with joblib parallel. No integration of pyspark. 

Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.

 

targets:
  dev_cluster: &dev_cluster
    new_cluster:
      cluster_log_conf:
        dbfs:
          destination: "dbfs:/FileStore/logs"
      spark_version: 14.3.x-scala2.12
      node_type_id: m5d.16xlarge
      custom_tags:
        clusterSource: forecasting
      data_security_mode: SINGLE_USER
      autotermination_minutes: 20
      autoscale:
        min_workers: 3
        max_workers: 20
      docker_image:
        url: "**************"
      aws_attributes:
        first_on_demand: 1
        instance_profile_arn: **************
        ebs_volume_type: GENERAL_PURPOSE_SSD
        ebs_volume_count: 1
        ebs_volume_size: 50
2 ACCEPTED SOLUTIONS

Accepted Solutions

harishgehlot
New Contributor III

Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.

  • We should not opt for 3 - 10 worker nodes, it should be one for default only right ?
  • Let's suppose, it is very long hours of running job. SPOT INSTANCE is not advisable as it will auto terminated by cloud provider right ? We should opt for On-Demand Instance right ?
  • Can you suggest some workflow code for my need as I'm discussing here if possible only.
first_on_demand: 1 (In workflow template)

 

View solution in original post

Shua42
Databricks Employee
Databricks Employee

Hi @harishgehlot ,

- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.

- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.

- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.

 

View solution in original post

4 REPLIES 4

Shua42
Databricks Employee
Databricks Employee

Hey @harishgehlot ,

You are right in that it is not worth it to use workers if your code is mostly Pandas. Pandas runs primarily on the driver node, so no workers are needed as nothing is being distributed to the workers as they would be with Spark. I would just opt for a sufficiently large driver to make sure it performs well and so you don't run into out-of-memory errors.

harishgehlot
New Contributor III

Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.

  • We should not opt for 3 - 10 worker nodes, it should be one for default only right ?
  • Let's suppose, it is very long hours of running job. SPOT INSTANCE is not advisable as it will auto terminated by cloud provider right ? We should opt for On-Demand Instance right ?
  • Can you suggest some workflow code for my need as I'm discussing here if possible only.
first_on_demand: 1 (In workflow template)

 

Shua42
Databricks Employee
Databricks Employee

Hi @harishgehlot ,

- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.

- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.

- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.

 

harishgehlot
New Contributor III

Thanks @Shua42 . You really helped me a lot.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now