Databricks Community

mala · ‎01-10-2023

Hi

I have been trying to reproduce Kmeans results with no luck

Here is my code snippet:

from pyspark.ml.clustering import KMeans

KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001)

Can anyone help?

mala · ‎01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

View solution in original post

Debayan · ‎01-11-2023

Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.

mala · ‎01-11-2023

Hi Debaya

Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.

I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.

Is there a way to reproduce the results in Spark?

Thanks

Mala

mala · ‎01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

Databricks Community

Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Connect with Databricks Users in Your Area

Securely share data, analytics and AI

Data Intelligence for Data Engineers

Databricks Learning Festival (Virtual): 15 January - 31 January 2025