cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoML with Stratified Sampling

jaredaw
New Contributor II

Is it possible to use a stratified sampling strategy for the train/test/validate splits that the automl library does? We are working in a context where we need to segregate certain groups from the training and test sets to see how our models generalize at level of these groups.

One hack that I considered was trying to to twist our classification model into a forecasting model by ordering our groups and then using the time_col parameter, but this is definitely the Wrong Way. Does anyone have any insight?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Jared Webb​ :

Yes, it is possible to use a stratified sampling strategy for the train/test/validate splits in the AutoML library. The AutoMLConfig class in the azureml.train.automl package allows you to specify a

featurization configuration, which includes a stratification_column_names parameter that you can use to specify the column(s) to stratify on. Here is an example code snippet that shows how to use stratified sampling with AutoML:

from azureml.core import Dataset
from azureml.train.automl import AutoMLConfig
 
# Load your dataset from a registered dataset in AzureML
dataset = Dataset.get_by_name(workspace, dataset_name)
 
# Specify the column(s) to stratify on
stratification_columns = ['group']
 
# Configure AutoML to use stratified sampling
automl_config = AutoMLConfig(
    task='classification',
    primary_metric='accuracy',
    training_data=dataset,
    label_column_name='label',
    featurization='auto',
    n_cross_validations=5,
    stratification_column_names=stratification_columns,
    compute_target=compute_target,
    max_concurrent_iterations=4,
    experiment_timeout_minutes=30,
    enable_early_stopping=True
)
 
# Run AutoML
automl_run = experiment.submit(automl_config)

In this example, stratification_columns is a list of column names to use for stratification. You can include as many columns as needed to get the desired level of stratification.

Note that you need to ensure that the stratification_columns you use are present in your dataset and have appropriate values for stratification. Also, keep in mind that stratification may not always be possible or desirable depending on your use case and data distribution, so you should carefully consider whether it is appropriate for your specific situation.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Jared Webb​ :

Yes, it is possible to use a stratified sampling strategy for the train/test/validate splits in the AutoML library. The AutoMLConfig class in the azureml.train.automl package allows you to specify a

featurization configuration, which includes a stratification_column_names parameter that you can use to specify the column(s) to stratify on. Here is an example code snippet that shows how to use stratified sampling with AutoML:

from azureml.core import Dataset
from azureml.train.automl import AutoMLConfig
 
# Load your dataset from a registered dataset in AzureML
dataset = Dataset.get_by_name(workspace, dataset_name)
 
# Specify the column(s) to stratify on
stratification_columns = ['group']
 
# Configure AutoML to use stratified sampling
automl_config = AutoMLConfig(
    task='classification',
    primary_metric='accuracy',
    training_data=dataset,
    label_column_name='label',
    featurization='auto',
    n_cross_validations=5,
    stratification_column_names=stratification_columns,
    compute_target=compute_target,
    max_concurrent_iterations=4,
    experiment_timeout_minutes=30,
    enable_early_stopping=True
)
 
# Run AutoML
automl_run = experiment.submit(automl_config)

In this example, stratification_columns is a list of column names to use for stratification. You can include as many columns as needed to get the desired level of stratification.

Note that you need to ensure that the stratification_columns you use are present in your dataset and have appropriate values for stratification. Also, keep in mind that stratification may not always be possible or desirable depending on your use case and data distribution, so you should carefully consider whether it is appropriate for your specific situation.

Anonymous
Not applicable

HI @Jared Webb​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group