What is the best practice for applying MLFlow to clustering algorithms?

User16826993440 — Tue, 08 Jun 2021 16:42:39 GMT

What is the best practice for applying MLFlow to clustering algorithms? What are the kinds of metrics customers track?

Re: What is the best practice for applying MLFlow to clustering algorithms?

Joseph_B — Fri, 18 Jun 2021 21:34:39 GMT

Good question! I'll divide my suggestions into 2 parts:

(1) In terms of MLflow Tracking, clustering is pretty similar to other ML workflows, so not much changes.

(2) In terms of specific parameters, metrics, etc. to track, clustering is very different, so being aware of common and useful things to track is helpful.

For (1), the generic pieces of an ML workflow should be tracked in the same way as for classification, regression, and other problems:

Params, especially whatever hyperparameters you changed from defaults
Metrics (see below)
Data source and version
Code / notebook
etc.

For (2), I'll list some recommendations I have for important params, metrics, etc., but I'll be interested to hear from others, especially if you have links to more detailed resources.

The "right" metrics to use can be very problem-dependent and model-dependent. At a high level, I'd make sure to log:

The metric your algorithm is optimizing: For example, K-means optimizes for Euclidean distance. The scikit-learn documentation has a great list of metrics ("geometry") for models it supports: https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
The metric you care most about: For example, if you know ground-truth assignments, you might use the Rand index. If you don't have ground-truth, you might use the Silhouette coefficient. The scikit-learn documentation has lengthy explanations of some clustering metrics: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation The Wikipedia page is good too: https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment
(Both of the above, for both training and validation data)

Hope this helps!

Re: What is the best practice for applying MLFlow to clustering algorithms?

wallco26 — Mon, 07 Oct 2024 17:30:31 GMT

Does it make sense to register a Kmeans clustering model once the experiment has been tracked and you are satisfied with the outcome? If so, how do you do it?

topic Re: What is the best practice for applying MLFlow to clustering algorithms? in Machine Learning

What is the best practice for applying MLFlow to clustering algorithms?

Re: What is the best practice for applying MLFlow to clustering algorithms?

Re: What is the best practice for applying MLFlow to clustering algorithms?