Databricks Community

mala · ‎01-10-2023

Hi

I have been trying to reproduce Kmeans results with no luck

Here is my code snippet:

from pyspark.ml.clustering import KMeans

KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001)

Can anyone help?

mala · ‎01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

View solution in original post

Debayan · ‎01-11-2023

Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.

mala · ‎01-11-2023

Hi Debaya

Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.

I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.

Is there a way to reproduce the results in Spark?

Thanks

Mala