โ01-10-2023 03:25 PM
Hi
I have been trying to reproduce Kmeans results with no luck
Here is my code snippet:
from pyspark.ml.clustering import KMeans
KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001)
Can anyone help?
โ01-19-2023 10:52 AM
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.
I was able to resolve this by making sure the same data is assigned to the same partitions :
df.repartition(num_partitions, "ur_col_id")
df.sortWithinPartitions("ur_col_id")
โ01-11-2023 01:11 PM
Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.
โ01-11-2023 01:20 PM
Hi Debaya
Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.
I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.
Is there a way to reproduce the results in Spark?
Thanks
Mala
โ01-19-2023 10:52 AM
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.
I was able to resolve this by making sure the same data is assigned to the same partitions :
df.repartition(num_partitions, "ur_col_id")
df.sortWithinPartitions("ur_col_id")
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group