cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mala
by New Contributor III
  • 1674 Views
  • 3 replies
  • 2 kudos

Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?

  • 1674 Views
  • 3 replies
  • 2 kudos
Latest Reply
mala
New Contributor III
  • 2 kudos

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...

  • 2 kudos
2 More Replies
Labels