- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2023 03:25 PM
Hi
I have been trying to reproduce Kmeans results with no luck
Here is my code snippet:
from pyspark.ml.clustering import KMeans
KMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001)
Can anyone help?
- Labels:
-
Kmeans
-
Spark Kmeans
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2023 10:52 AM
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.
I was able to resolve this by making sure the same data is assigned to the same partitions :
df.repartition(num_partitions, "ur_col_id")
df.sortWithinPartitions("ur_col_id")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 01:11 PM
Hi, Do you receive any errors? Please refer https://www.databricks.com/tensorflow/clustering-and-k-means for examples. Please let us know if this helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 01:20 PM
Hi Debaya
Thanks for your reply, it runs without any issues. After rerunning the model each time, I got different cluster outputs even after applying seed and tolerance as I have mentioned in my code snippet.
I would expect the results to be the same once you apply seed since it removes any randomness. I also increased the number of iterations which didn't help either.
Is there a way to reproduce the results in Spark?
Thanks
Mala
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2023 10:52 AM
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.
I was able to resolve this by making sure the same data is assigned to the same partitions :
df.repartition(num_partitions, "ur_col_id")
df.sortWithinPartitions("ur_col_id")

