cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How should I tune hyperparameters when fitting models for every item?

Joseph_B
Databricks Employee
Databricks Employee

My dataset has an "item" column which groups the rows into many groups. (Think of these groups as items in a store.) I want to fit 1 ML model per group. Should I tune hyperparameters for each group separately? Or should I tune them for the entire process so that every group uses the same set of hyperparameters?

And, for either option, how do I set up that tuning?

1 REPLY 1

Joseph_B
Databricks Employee
Databricks Employee

For the first question ("which option is better?"), you need to answer that via your understanding of the problem domain.

  • Do you expect similar behavior across the groups (items)?
    • If so, that's a +1 in favor of sharing hyperparameters. And vice versa.
  • Do you have similar numbers of training examples per group?
    • If so, that's also a +1 in favor of sharing hyperparameters. And vice versa.
  • How large are groups vs. the full dataset?
    • If each group is tiny vs. the full dataset, then you can expect tuning to be much less stable if run for each group separately. E.g., if you retrain your model each week, you might be very different hyperparameters (and predictions) each time.
    • I.e., if each group is fairly small, then consider using shared hyperparameters.

For the second question ("how do I do it?"), here's a sketch. This sketch is for groups which are small enough to fit on 1 machine. (When some or all groups are too large to fit on 1 machine, you can handle them separately using distributed training algorithms.)

  • For both, use an Apache Spark DataFrame with groupBy to create a grouped DataFrame. Then apply a Pandas UDF on the group. Within that UDF, call model training (or tuning, if applicable).
  • Shared hyperparameters:
    • Call tuning, e.g., Hyperopt to run on the driver.
    • Each time the tuning algorithm tests 1 set of hyperparameters, it should fit models for all groups by applying the Pandas UDF to the DataFrame.
    • Within the Pandas UDF, your ML library (e.g., sklearn) will use the global hyperparameter setting provided by Hyperopt to fit a model for that group.
  • Separate hyperparameters:
    • Apply the Pandas UDF to the DataFrame.
    • Within the Pandas UDF, call tuning (e.g., Hyperopt). Tuning will call your ML library in turn. After tuning, you will have your model for that group.
  • Note: If using Hyperopt, note that Hyperopt should use regular Trials, not SparkTrials. This is because the Pandas UDF application uses distributed computing, so Hyperopt itself needs to run locally.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group