cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How should I tune hyperparameters when fitting models for every item?

Joseph_B
New Contributor III
New Contributor III

My dataset has an "item" column which groups the rows into many groups. (Think of these groups as items in a store.) I want to fit 1 ML model per group. Should I tune hyperparameters for each group separately? Or should I tune them for the entire process so that every group uses the same set of hyperparameters?

And, for either option, how do I set up that tuning?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @ Joseph B! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Joseph_B
New Contributor III
New Contributor III

For the first question ("which option is better?"), you need to answer that via your understanding of the problem domain.

  • Do you expect similar behavior across the groups (items)?
    • If so, that's a +1 in favor of sharing hyperparameters. And vice versa.
  • Do you have similar numbers of training examples per group?
    • If so, that's also a +1 in favor of sharing hyperparameters. And vice versa.
  • How large are groups vs. the full dataset?
    • If each group is tiny vs. the full dataset, then you can expect tuning to be much less stable if run for each group separately. E.g., if you retrain your model each week, you might be very different hyperparameters (and predictions) each time.
    • I.e., if each group is fairly small, then consider using shared hyperparameters.

For the second question ("how do I do it?"), here's a sketch. This sketch is for groups which are small enough to fit on 1 machine. (When some or all groups are too large to fit on 1 machine, you can handle them separately using distributed training algorithms.)

  • For both, use an Apache Spark DataFrame with groupBy to create a grouped DataFrame. Then apply a Pandas UDF on the group. Within that UDF, call model training (or tuning, if applicable).
  • Shared hyperparameters:
    • Call tuning, e.g., Hyperopt to run on the driver.
    • Each time the tuning algorithm tests 1 set of hyperparameters, it should fit models for all groups by applying the Pandas UDF to the DataFrame.
    • Within the Pandas UDF, your ML library (e.g., sklearn) will use the global hyperparameter setting provided by Hyperopt to fit a model for that group.
  • Separate hyperparameters:
    • Apply the Pandas UDF to the DataFrame.
    • Within the Pandas UDF, call tuning (e.g., Hyperopt). Tuning will call your ML library in turn. After tuning, you will have your model for that group.
  • Note: If using Hyperopt, note that Hyperopt should use regular Trials, not SparkTrials. This is because the Pandas UDF application uses distributed computing, so Hyperopt itself needs to run locally.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!