topic Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pandas. in Administration & Architecture

Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pandas.

harishgehlot — Mon, 19 May 2025 06:52:19 GMT

Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like num_worker_nodes > 1 or autoscale with range of worker nodes if my codebase has mostly pandas integration and performing parallelization with joblib parallel. No integration of pyspark.

Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.

targets: dev_cluster: &dev_cluster new_cluster: cluster_log_conf: dbfs: destination: "dbfs:/FileStore/logs" spark_version: 14.3.x-scala2.12 node_type_id: m5d.16xlarge custom_tags: clusterSource: forecasting data_security_mode: SINGLE_USER autotermination_minutes: 20 autoscale: min_workers: 3 max_workers: 20 docker_image: url: "**************" aws_attributes: first_on_demand: 1 instance_profile_arn: ************** ebs_volume_type: GENERAL_PURPOSE_SSD ebs_volume_count: 1 ebs_volume_size: 50

Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand

Shua42 — Mon, 19 May 2025 22:12:38 GMT

Hey @harishgehlot ,

You are right in that it is not worth it to use workers if your code is mostly Pandas. Pandas runs primarily on the driver node, so no workers are needed as nothing is being distributed to the workers as they would be with Spark. I would just opt for a sufficiently large driver to make sure it performs well and so you don't run into out-of-memory errors.

Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand

harishgehlot — Tue, 20 May 2025 04:08:07 GMT

Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.

We should not opt for 3 - 10 worker nodes, it should be one for default only right ?
Let's suppose, it is very long hours of running job. SPOT INSTANCE is not advisable as it will auto terminated by cloud provider right ? We should opt for On-Demand Instance right ?
Can you suggest some workflow code for my need as I'm discussing here if possible only.

first_on_demand: 1 (In workflow template)

Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand

Shua42 — Tue, 20 May 2025 14:21:10 GMT

Hi @harishgehlot ,

- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.

- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.

- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.

Re: Shall we opt for multiple worker nodes in dab workflow template if our codebase is based on pand

harishgehlot — Tue, 20 May 2025 16:13:06 GMT

Thanks @Shua42 . You really helped me a lot.