What is the best practice for applying MLFlow to clustering algorithms?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2021 09:42 AM
What is the best practice for applying MLFlow to clustering algorithms? What are the kinds of metrics customers track?
- Labels:
-
Best practice
-
MlFlow
-
Model Lifecycle
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2021 02:34 PM
Good question! I'll divide my suggestions into 2 parts:
(1) In terms of MLflow Tracking, clustering is pretty similar to other ML workflows, so not much changes.
(2) In terms of specific parameters, metrics, etc. to track, clustering is very different, so being aware of common and useful things to track is helpful.
For (1), the generic pieces of an ML workflow should be tracked in the same way as for classification, regression, and other problems:
- Params, especially whatever hyperparameters you changed from defaults
- Metrics (see below)
- Data source and version
- Code / notebook
- etc.
For (2), I'll list some recommendations I have for important params, metrics, etc., but I'll be interested to hear from others, especially if you have links to more detailed resources.
The "right" metrics to use can be very problem-dependent and model-dependent. At a high level, I'd make sure to log:
- The metric your algorithm is optimizing: For example, K-means optimizes for Euclidean distance. The scikit-learn documentation has a great list of metrics ("geometry") for models it supports: https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
- The metric you care most about: For example, if you know ground-truth assignments, you might use the Rand index. If you don't have ground-truth, you might use the Silhouette coefficient. The scikit-learn documentation has lengthy explanations of some clustering metrics: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation The Wikipedia page is good too: https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment
- (Both of the above, for both training and validation data)
Hope this helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-07-2024 10:30 AM
Does it make sense to register a Kmeans clustering model once the experiment has been tracked and you are satisfied with the outcome? If so, how do you do it?