cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Lanz
Databricks Employee
Databricks Employee

When running an AutoML experiment on Databricks, the default setup treats each data sample as equally important. However, this approach can be problematic when dealing with highly imbalanced datasets. To address this issue and accommodate users who want to incorporate their domain knowledge into the model training process, the ML Runtime 15.4 release allows customization of the weight of each data sample or class when training regression and classification models. The following example demonstrates how to adjust sample weight when training a classification model on a highly imbalanced dataset.

Example: Predict client subscription at a bank

In this demo, a data scientist at a bank wants to perform a classification task to predict whether a client will subscribe to a term deposit. The dataset is imbalanced, with 88.3% of clients not subscribing and only 11.7% subscribing as shown in the table below (label column shows if clients subscribe, 1 for no subscribe, 2 for subscribe):

Lanz_0-1724967303741.png

To address the class imbalance, the data scientist assigns a weight of 1 to subscribed class and a weight of 8 to not subscribed class respectively based on subscription vs not subscription ratio. Note, the data scient can also choose different weights based on their domain knowledge. Rows with higher weights will have more influence on the model weights during training.

from pyspark.sql.functions import when
df = df.withColumn("sample_weight", when(df["label"] == 1, 1).otherwise(8))
display(df)

Lanz_1-1724967085384.png

After inserting the sample_weight column into the data table, we can directly pass it to the classify API in the AutoML Python SDK.

from databricks import automl

automl.classify(df, target_col="label", sample_weight_col="sample_weight")