💡 ML Training Tip Of The Week #3 - Adjust Sample ... - Databricks Community

Technical Blog

Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.

When running an AutoML experiment on Databricks, the default setup treats each data sample as equally important. However, this approach can be problematic when dealing with highly imbalanced datasets. To address this issue and accommodate users who want to incorporate their domain knowledge into the model training process, the ML Runtime 15.4 release allows customization of the weight of each data sample or class when training regression and classification models. The following example demonstrates how to adjust sample weight when training a classification model on a highly imbalanced dataset.

Example: Predict client subscription at a bank

In this demo, a data scientist at a bank wants to perform a classification task to predict whether a client will subscribe to a term deposit. The dataset is imbalanced, with 88.3% of clients not subscribing and only 11.7% subscribing as shown in the table below (label column shows if clients subscribe, 1 for no subscribe, 2 for subscribe):

To address the class imbalance, the data scientist assigns a weight of 1 to subscribed class and a weight of 8 to not subscribed class respectively based on subscription vs not subscription ratio. Note, the data scient can also choose different weights based on their domain knowledge. Rows with higher weights will have more influence on the model weights during training.

from pyspark.sql.functions import when
df = df.withColumn("sample_weight", when(df["label"] == 1, 1).otherwise(8))
display(df)

After inserting the sample_weight column into the data table, we can directly pass it to the classify API in the AutoML Python SDK.

from databricks import automl

automl.classify(df, target_col="label", sample_weight_col="sample_weight")

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Contributors

lin-yuan

Databricks Community

💡 ML Training Tip Of The Week #3 - Adjust Sample Weight for Imbalanced Dataset in AutoML

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks