Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2016 10:51 AM
Hi - KNN is notoriously hard to parallelize in Spark because KNN is a "lazy learner" and the model itself is the entire dataset. Most single machine implementations rely on KD Trees or Ball Trees to store the entire dataset in the RAM of a single machine. I would recommend using scikit-learn's single machine implementation with a Simple Random Sample of the dataset if you really want to use KNN.