Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance
Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?
- 2559 Views
- 3 replies
- 2 kudos
Latest Reply
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...
- 2 kudos