KNN classifier on Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2016 04:50 PM
Hi Team ,
Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset.
Even I want to validate the KNN model with the testing dataset.
I tried to use scikit learn but the program is running locally. I want to distirbute the classifier while train the model.
At the end, I want to validate the classifier with testing dataset and Calculate the accuracy.
- Labels:
-
Dataframes
-
Machine Learning
-
Scikit-learn
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2016 09:51 AM
Refer to the programming guide to see the algorithms available in MLlib:
http://spark.apache.org/docs/latest/ml-classification-regression.html
There is no KNN in MLlib, you might want to try another algorithm that's available.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2016 10:51 AM
Hi - KNN is notoriously hard to parallelize in Spark because KNN is a "lazy learner" and the model itself is the entire dataset. Most single machine implementations rely on KD Trees or Ball Trees to store the entire dataset in the RAM of a single machine. I would recommend using scikit-learn's single machine implementation with a Simple Random Sample of the dataset if you really want to use KNN.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-04-2020 06:31 PM
Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same.
It works on a distributed framework (MPI based) and can run on any system.

