Databricks Community

mala · 01-10-2023

Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?

mala · 01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...

mala · 01-11-2023

Hi DebayaThanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet. I would expect the results to be the same ...

Databricks Community

User Stats

User Activity

Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Re: Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Re: Unable to reproduce Kmeans Clustering results even after setting seed and tolerance