KNN classifier on Spark

Muthu145
New Contributor

Hi Team ,

Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset.

Even I want to validate the KNN model with the testing dataset.

I tried to use scikit learn but the program is running locally. I want to distirbute the classifier while train the model.

At the end, I want to validate the classifier with testing dataset and Calculate the accuracy.

raela
Databricks Employee
Databricks Employee

Refer to the programming guide to see the algorithms available in MLlib:

http://spark.apache.org/docs/latest/ml-classification-regression.html

There is no KNN in MLlib, you might want to try another algorithm that's available.

rlgarris
Databricks Employee
Databricks Employee

Hi - KNN is notoriously hard to parallelize in Spark because KNN is a "lazy learner" and the model itself is the entire dataset. Most single machine implementations rely on KD Trees or Ball Trees to store the entire dataset in the RAM of a single machine. I would recommend using scikit-learn's single machine implementation with a Simple Random Sample of the dataset if you really want to use KNN.

SouravSaha
New Contributor II

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same.

Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_dem...

It works on a distributed framework (MPI based) and can run on any system.