topic Re: How do I distribute machine learning process in my spark data frame in Machine Learning

How do I distribute machine learning process in my spark data frame

mohaimen_syed — Tue, 19 Dec 2023 19:39:39 GMT

Hi,

I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.

My code:

vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Random Forest Classifier

rf = RandomForestClassifier(featuresCol="features", labelCol="target", numTrees=100)

# Pipeline

pipeline = Pipeline(stages=[vector_assembler, rf])

# Hyperparameter tuning using Cross-Validation

param_grid = ParamGridBuilder().build()

evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="rawPrediction", metricName="areaUnderROC")

cross_validator = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Train the model

cv_model = cross_validator.fit(df)

# Make predictions

predictions = cv_model.transform(df)

# Evaluate the model

evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

print(f"Accuracy: {accuracy}")

Re: How do I distribute machine learning process in my spark data frame

shan_chandra — Thu, 11 Jan 2024 21:02:16 GMT

@mohaimen_syed - can you please try using pyspark.ml implementation of randomForestClassifier instead of sklearn and see if it works. Below is an example - https://github.com/apache/spark/blob/master/examples/src/main/python/ml/random_forest_classifier_example.py

Thanks, Shan

Re: How do I distribute machine learning process in my spark data frame

mohaimen_syed — Thu, 11 Jan 2024 21:55:10 GMT

Thank you for your reply @shan_chandra . I looked at this code and tried doing the same thing. The cluster uses 2 nodes at most, even though there's 60 available. I believe the advantage of using Databricks is to use the distributed compute method, but I'm not sure how to effectively use it.

Re: How do I distribute machine learning process in my spark data frame

shan_chandra — Tue, 16 Jan 2024 15:43:27 GMT

@mohaimen_syed - There are many reasons why only 2 nodes are used at the most.

1. sklearn implementation of randomforest classifier is not distributed. Please use pyspark.ml implementation

2. your dataframe may be small enough.

Always start with a small number of nodes and modify the number of nodes based on your workload.

Re: How do I distribute machine learning process in my spark data frame

mohaimen_syed — Tue, 16 Jan 2024 19:32:13 GMT

I have tried using pyspark.ml, and I used the link you sent me to mimic the process. The data I'm using is pretty large and takes over 30 mins to run. I have not written any code to update the nodes. I want to learn how to use more than two nodes to increase the performance so I can add more features.