Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...
Hi DebayaThanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet. I would expect the results to be the same ...