rlgarris
Databricks Employee
Databricks Employee

Hi - KNN is notoriously hard to parallelize in Spark because KNN is a "lazy learner" and the model itself is the entire dataset. Most single machine implementations rely on KD Trees or Ball Trees to store the entire dataset in the RAM of a single machine. I would recommend using scikit-learn's single machine implementation with a Simple Random Sample of the dataset if you really want to use KNN.