a month ago
Hi team, I am working in a databricks asset bundle architecture. Added my codebase repo in a workspace. My question to do we need to opt for multiple worker nodes like num_worker_nodes > 1 or autoscale with range of worker nodes if my codebase has mostly pandas integration and performing parallelization with joblib parallel. No integration of pyspark.
Does it make sense to go with multiple nodes,or I am increasing my money for waste of idle nodes.
targets:
dev_cluster: &dev_cluster
new_cluster:
cluster_log_conf:
dbfs:
destination: "dbfs:/FileStore/logs"
spark_version: 14.3.x-scala2.12
node_type_id: m5d.16xlarge
custom_tags:
clusterSource: forecasting
data_security_mode: SINGLE_USER
autotermination_minutes: 20
autoscale:
min_workers: 3
max_workers: 20
docker_image:
url: "**************"
aws_attributes:
first_on_demand: 1
instance_profile_arn: **************
ebs_volume_type: GENERAL_PURPOSE_SSD
ebs_volume_count: 1
ebs_volume_size: 50
a month ago
Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.
first_on_demand: 1 (In workflow template)
a month ago
Hi @harishgehlot ,
- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.
- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.
- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.
a month ago
Hey @harishgehlot ,
You are right in that it is not worth it to use workers if your code is mostly Pandas. Pandas runs primarily on the driver node, so no workers are needed as nothing is being distributed to the workers as they would be with Spark. I would just opt for a sufficiently large driver to make sure it performs well and so you don't run into out-of-memory errors.
a month ago
Thanks @Shua42 for your response. I hope so we can discuss more here as well. As pandas doesn't support distributed computation.
first_on_demand: 1 (In workflow template)
a month ago
Hi @harishgehlot ,
- Right, you can even opt for single node since you don't need any workers if you're only running Pandas to process the data.
- Yes, your right in that on-demand instances are preferable for long-running tasks because of that termination risk with spot instances, especially if your code isn't fault tolerant.
- I'm not sure of all the configurations you'd need based on your code and tasks, but you can add availability: ON_DEMAND to ensure it's not using spot instances.
a month ago
Thanks @Shua42 . You really helped me a lot.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now