For the first question ("which option is better?"), you need to answer that via your understanding of the problem domain.
- Do you expect similar behavior across the groups (items)?
- If so, that's a +1 in favor of sharing hyperparameters. And vice versa.
- Do you have similar numbers of training examples per group?
- If so, that's also a +1 in favor of sharing hyperparameters. And vice versa.
- How large are groups vs. the full dataset?
- If each group is tiny vs. the full dataset, then you can expect tuning to be much less stable if run for each group separately. E.g., if you retrain your model each week, you might be very different hyperparameters (and predictions) each time.
- I.e., if each group is fairly small, then consider using shared hyperparameters.
For the second question ("how do I do it?"), here's a sketch. This sketch is for groups which are small enough to fit on 1 machine. (When some or all groups are too large to fit on 1 machine, you can handle them separately using distributed training algorithms.)
- For both, use an Apache Spark DataFrame with groupBy to create a grouped DataFrame. Then apply a Pandas UDF on the group. Within that UDF, call model training (or tuning, if applicable).
- Shared hyperparameters:
- Call tuning, e.g., Hyperopt to run on the driver.
- Each time the tuning algorithm tests 1 set of hyperparameters, it should fit models for all groups by applying the Pandas UDF to the DataFrame.
- Within the Pandas UDF, your ML library (e.g., sklearn) will use the global hyperparameter setting provided by Hyperopt to fit a model for that group.
- Separate hyperparameters:
- Apply the Pandas UDF to the DataFrame.
- Within the Pandas UDF, call tuning (e.g., Hyperopt). Tuning will call your ML library in turn. After tuning, you will have your model for that group.
- Note: If using Hyperopt, note that Hyperopt should use regular Trials, not SparkTrials. This is because the Pandas UDF application uses distributed computing, so Hyperopt itself needs to run locally.