Databricks Community

Joseph_B · ‎12-20-2021

My dataset has an "item" column which groups the rows into many groups. (Think of these groups as items in a store.) I want to fit 1 ML model per group. Should I tune hyperparameters for each group separately? Or should I tune them for the entire process so that every group uses the same set of hyperparameters?

And, for either option, how do I set up that tuning?

Joseph_B · ‎12-20-2021

For the first question ("which option is better?"), you need to answer that via your understanding of the problem domain.

Do you expect similar behavior across the groups (items)?
- If so, that's a +1 in favor of sharing hyperparameters. And vice versa.
Do you have similar numbers of training examples per group?
- If so, that's also a +1 in favor of sharing hyperparameters. And vice versa.
How large are groups vs. the full dataset?
- If each group is tiny vs. the full dataset, then you can expect tuning to be much less stable if run for each group separately. E.g., if you retrain your model each week, you might be very different hyperparameters (and predictions) each time.
- I.e., if each group is fairly small, then consider using shared hyperparameters.

For the second question ("how do I do it?"), here's a sketch. This sketch is for groups which are small enough to fit on 1 machine. (When some or all groups are too large to fit on 1 machine, you can handle them separately using distributed training algorithms.)

For both, use an Apache Spark DataFrame with groupBy to create a grouped DataFrame. Then apply a Pandas UDF on the group. Within that UDF, call model training (or tuning, if applicable).
Shared hyperparameters:
- Call tuning, e.g., Hyperopt to run on the driver.
- Each time the tuning algorithm tests 1 set of hyperparameters, it should fit models for all groups by applying the Pandas UDF to the DataFrame.
- Within the Pandas UDF, your ML library (e.g., sklearn) will use the global hyperparameter setting provided by Hyperopt to fit a model for that group.
Separate hyperparameters:
- Apply the Pandas UDF to the DataFrame.
- Within the Pandas UDF, call tuning (e.g., Hyperopt). Tuning will call your ML library in turn. After tuning, you will have your model for that group.
Note: If using Hyperopt, note that Hyperopt should use regular Trials, not SparkTrials. This is because the Pandas UDF application uses distributed computing, so Hyperopt itself needs to run locally.

Databricks Community

How should I tune hyperparameters when fitting models for every item?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!