cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

KNN classifier on Spark

Muthu145
New Contributor

Hi Team ,

Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset.

Even I want to validate the KNN model with the testing dataset.

I tried to use scikit learn but the program is running locally. I want to distirbute the classifier while train the model.

At the end, I want to validate the classifier with testing dataset and Calculate the accuracy.

3 REPLIES 3

raela
New Contributor III
New Contributor III

Refer to the programming guide to see the algorithms available in MLlib:

http://spark.apache.org/docs/latest/ml-classification-regression.html

There is no KNN in MLlib, you might want to try another algorithm that's available.

User16826991422
Contributor

Hi - KNN is notoriously hard to parallelize in Spark because KNN is a "lazy learner" and the model itself is the entire dataset. Most single machine implementations rely on KD Trees or Ball Trees to store the entire dataset in the RAM of a single machine. I would recommend using scikit-learn's single machine implementation with a Simple Random Sample of the dataset if you really want to use KNN.

SouravSaha
New Contributor II

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same.

Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_dem...

It works on a distributed framework (MPI based) and can run on any system.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.