cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I distribute machine learning process in my spark data frame

mohaimen_syed
New Contributor III

Hi,

I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.

My code:

vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Random Forest Classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="target", numTrees=100)

# Pipeline
pipeline = Pipeline(stages=[vector_assembler, rf])

# Hyperparameter tuning using Cross-Validation
param_grid = ParamGridBuilder().build() 
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
cross_validator = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Train the model
cv_model = cross_validator.fit(df)

# Make predictions
predictions = cv_model.transform(df)


# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")
4 REPLIES 4

shan_chandra
Databricks Employee
Databricks Employee

@mohaimen_syed  - can you please try using pyspark.ml implementation of randomForestClassifier instead of sklearn and see if it works. Below is an example - https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_exa...

Thanks, Shan

Thank you for your reply @shan_chandra . I looked at this code and tried doing the same thing. The cluster uses 2 nodes at most, even though there's 60 available. I believe the advantage of using Databricks is to use the distributed compute method, but I'm not sure how to effectively use it.

@mohaimen_syed - There are many reasons why only 2 nodes are used at the most. 

1. sklearn implementation of randomforest classifier is not distributed. Please use pyspark.ml implementation

2. your dataframe may be small enough.

Always start with a small number of nodes and modify the number of nodes based on your workload. 

I have tried using pyspark.ml, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group