Databricks Community

aswinkks · ‎05-28-2025

Hi,

I wanted to confirm, in a distributed training, if there is any way that I can control what kind/amount of load/data can be send to specific worker nodes, manually ..Or is it completely automatically handled by spark's scheduler, and we don't have control over that

Renu_ · ‎05-29-2025

From what I know, Spark automatically handles how data and workload are distributed across worker nodes during distributed training, you can't manually control exactly what or how much data goes to a specific node. You can still influence the distribution to some extent by using techniques like repartition, partitionBy, or custom partitioners. These help control how the data is distributed across partitions, but not which worker node ends up processing which part. Spark’s scheduler still decides that part behind the scenes.

Databricks Community

Load assignment during Distributed training

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! October 31 – November 06, 2025

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog