Re: How do I distribute machine learning process i...

shan_chandra · ‎01-16-2024

@mohaimen_syed - There are many reasons why only 2 nodes are used at the most.

1. sklearn implementation of randomforest classifier is not distributed. Please use pyspark.ml implementation

2. your dataframe may be small enough.

Always start with a small number of nodes and modify the number of nodes based on your workload.