โ12-19-2023 11:39 AM
Hi,
I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.
My code:
โ01-11-2024 01:02 PM
@mohaimen_syed - can you please try using pyspark.ml implementation of randomForestClassifier instead of sklearn and see if it works. Below is an example - https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_exa...
Thanks, Shan
โ01-11-2024 01:55 PM
Thank you for your reply @shan_chandra . I looked at this code and tried doing the same thing. The cluster uses 2 nodes at most, even though there's 60 available. I believe the advantage of using Databricks is to use the distributed compute method, but I'm not sure how to effectively use it.
โ01-16-2024 07:43 AM
@mohaimen_syed - There are many reasons why only 2 nodes are used at the most.
1. sklearn implementation of randomforest classifier is not distributed. Please use pyspark.ml implementation
2. your dataframe may be small enough.
Always start with a small number of nodes and modify the number of nodes based on your workload.
โ01-16-2024 11:32 AM
I have tried using pyspark.ml, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now