Databricks Community

User16826993440 · ‎06-08-2021

What is the best practice for applying MLFlow to clustering algorithms? What are the kinds of metrics customers track?

Joseph_B · ‎06-18-2021

Good question! I'll divide my suggestions into 2 parts:

(1) In terms of MLflow Tracking, clustering is pretty similar to other ML workflows, so not much changes.

(2) In terms of specific parameters, metrics, etc. to track, clustering is very different, so being aware of common and useful things to track is helpful.

For (1), the generic pieces of an ML workflow should be tracked in the same way as for classification, regression, and other problems:

Params, especially whatever hyperparameters you changed from defaults
Metrics (see below)
Data source and version
Code / notebook
etc.

For (2), I'll list some recommendations I have for important params, metrics, etc., but I'll be interested to hear from others, especially if you have links to more detailed resources.

The "right" metrics to use can be very problem-dependent and model-dependent. At a high level, I'd make sure to log:

The metric your algorithm is optimizing: For example, K-means optimizes for Euclidean distance. The scikit-learn documentation has a great list of metrics ("geometry") for models it supports: https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
The metric you care most about: For example, if you know ground-truth assignments, you might use the Rand index. If you don't have ground-truth, you might use the Silhouette coefficient. The scikit-learn documentation has lengthy explanations of some clustering metrics: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation The Wikipedia page is good too: https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment
(Both of the above, for both training and validation data)

Hope this helps!

wallco26 · ‎10-07-2024

Does it make sense to register a Kmeans clustering model once the experiment has been tracked and you are satisfied with the outcome? If so, how do you do it?

Databricks Community

What is the best practice for applying MLFlow to clustering algorithms?

Photos

Connect with Databricks Users in Your Area

Data + AI Summit 2025 — registration now open!

Jumpstart Your Data Journey with Databricks Get Started Days!

Databricks DevConnect: Global Community Meetups for Data Engineers

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks