Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

How do I distribute machine learning process in my spark data frame

New Contributor III


I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.

My code:

vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Random Forest Classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="target", numTrees=100)

# Pipeline
pipeline = Pipeline(stages=[vector_assembler, rf])

# Hyperparameter tuning using Cross-Validation
param_grid = ParamGridBuilder().build() 
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
cross_validator = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Train the model
cv_model =

# Make predictions
predictions = cv_model.transform(df)

# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")

Esteemed Contributor
Esteemed Contributor

@mohaimen_syed  - can you please try using implementation of randomForestClassifier instead of sklearn and see if it works. Below is an example -

Thanks, Shan

Thank you for your reply @shan_chandra . I looked at this code and tried doing the same thing. The cluster uses 2 nodes at most, even though there's 60 available. I believe the advantage of using Databricks is to use the distributed compute method, but I'm not sure how to effectively use it.

@mohaimen_syed - There are many reasons why only 2 nodes are used at the most. 

1. sklearn implementation of randomforest classifier is not distributed. Please use implementation

2. your dataframe may be small enough.

Always start with a small number of nodes and modify the number of nodes based on your workload. 

I have tried using, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!