cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mala
by New Contributor III
  • 2521 Views
  • 3 replies
  • 2 kudos

Resolved! Unable to reproduce Kmeans Clustering results even after setting seed and tolerance

Hi I have been trying to reproduce Kmeans results with no luckHere is my code snippet:from pyspark.ml.clustering import KMeansKMeans(featuresCol=featuresCol, k=clusters, maxIter=40, seed=1, tol = .00001) Can anyone help?

  • 2521 Views
  • 3 replies
  • 2 kudos
Latest Reply
mala
New Contributor III
  • 2 kudos

This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition. I was able to resolve this by making sure the same data is assigned to the same partitions :df.repartition(num_partitions, "ur_col_id")d...

  • 2 kudos
2 More Replies
Labels