cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

mala
New Contributor III

Hi

I have been trying to reproduce Kmeans results with no luck

Here is my code snippet:

from pyspark.ml.clustering import KMeans

KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) 

Can anyone help?

1 ACCEPTED SOLUTION

Accepted Solutions

mala
New Contributor III

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

View solution in original post

3 REPLIES 3

Debayan
Databricks Employee
Databricks Employee

Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.

mala
New Contributor III

Hi Debaya

Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.

I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.

Is there a way to reproduce the results in Spark?

Thanks

Mala

mala
New Contributor III

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.

I was able to resolve this by making sure the same data is assigned to the same partitions :

df.repartition(num_partitions, "ur_col_id")

df.sortWithinPartitions("ur_col_id")

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group