When running an AutoML experiment on Databricks, the default setup treats each data sample as equally important. However, this approach can be problematic when dealing with highly imbalanced datasets. To address this issue and accommodate users who want to incorporate their domain knowledge into the model training process, the ML Runtime 15.4 release allows customization of the weight of each data sample or class when training regression and classification models. The following example demonstrates how to adjust sample weight when training a classification model on a highly imbalanced dataset.
Example: Predict client subscription at a bank
In this demo, a data scientist at a bank wants to perform a classification task to predict whether a client will subscribe to a term deposit. The dataset is imbalanced, with 88.3% of clients not subscribing and only 11.7% subscribing as shown in the table below (label column shows if clients subscribe, 1 for no subscribe, 2 for subscribe):
To address the class imbalance, the data scientist assigns a weight of 1 to subscribed class and a weight of 8 to not subscribed class respectively based on subscription vs not subscription ratio. Note, the data scient can also choose different weights based on their domain knowledge. Rows with higher weights will have more influence on the model weights during training.
from pyspark.sql.functions import when
df = df.withColumn("sample_weight", when(df["label"] == 1, 1).otherwise(8))
display(df)
|
After inserting the sample_weight column into the data table, we can directly pass it to the classify API in the AutoML Python SDK.
from databricks import automl
automl.classify(df, target_col="label", sample_weight_col="sample_weight")
|