Hi,
I'm trying to use around 5 numerical features on 3.5 million rows to train and test my model with a spark data frame.My cluster has 60 nodes available but is only using 2. How can I distribute the process or make it for efficient and faster.
My code:
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
# Random Forest Classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="target", numTrees=100)
# Pipeline
pipeline = Pipeline(stages=[vector_assembler, rf])
# Hyperparameter tuning using Cross-Validation
param_grid = ParamGridBuilder().build()
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
cross_validator = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
# Train the model
cv_model = cross_validator.fit(df)
# Make predictions
predictions = cv_model.transform(df)
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")