cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

mala
New Contributor III

Hi

I have been trying to reproduce Kmeans results with no luck

Here is my code snippet:

from pyspark.ml.clustering import KMeans

KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) 

Can anyone help?

1 ACCEPTED SOLUTION

Accepted Solutions

mala
New Contributor III

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

View solution in original post

3 REPLIES 3

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.

mala
New Contributor III

Hi Debaya

Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.

I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.

Is there a way to reproduce the results in Spark?

Thanks

Mala

mala
New Contributor III

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.