From what I know, Spark automatically handles how data and workload are distributed across worker nodes during distributed training, you can't manually control exactly what or how much data goes to a specific node. You can still influence the distribution to some extent by using techniques like repartition, partitionBy, or custom partitioners. These help control how the data is distributed across partitions, but not which worker node ends up processing which part. Sparkโs scheduler still decides that part behind the scenes.