Databricks Community

mala · ‎01-10-2023

Hi

I have been trying to reproduce Kmeans results with no luck

Here is my code snippet:

from pyspark.ml.clustering import KMeans

KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001)

Can anyone help?

mala · ‎01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

View solution in original post

Debayan · ‎01-11-2023

Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.

mala · ‎01-11-2023

Hi Debaya

Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.

I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.

Is there a way to reproduce the results in Spark?

Thanks

Mala

mala · ‎01-19-2023

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

Databricks Community

Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟