Databricks Community

jaredaw · ‎05-09-2023

Is it possible to use a stratified sampling strategy for the train/test/validate splits that the automl library does? We are working in a context where we need to segregate certain groups from the training and test sets to see how our models generalize at level of these groups.

One hack that I considered was trying to to twist our classification model into a forecasting model by ordering our groups and then using the time_col parameter, but this is definitely the Wrong Way. Does anyone have any insight?

Thank you!

Anonymous · ‎05-13-2023

@Jared Webb :

Yes, it is possible to use a stratified sampling strategy for the train/test/validate splits in the AutoML library. The AutoMLConfig class in the azureml.train.automl package allows you to specify a

featurization configuration, which includes a stratification_column_names parameter that you can use to specify the column(s) to stratify on. Here is an example code snippet that shows how to use stratified sampling with AutoML:

from azureml.core import Dataset
from azureml.train.automl import AutoMLConfig
 
# Load your dataset from a registered dataset in AzureML
dataset = Dataset.get_by_name(workspace, dataset_name)
 
# Specify the column(s) to stratify on
stratification_columns = ['group']
 
# Configure AutoML to use stratified sampling
automl_config = AutoMLConfig(
    task='classification',
    primary_metric='accuracy',
    training_data=dataset,
    label_column_name='label',
    featurization='auto',
    n_cross_validations=5,
    stratification_column_names=stratification_columns,
    compute_target=compute_target,
    max_concurrent_iterations=4,
    experiment_timeout_minutes=30,
    enable_early_stopping=True
)
 
# Run AutoML
automl_run = experiment.submit(automl_config)

In this example, stratification_columns is a list of column names to use for stratification. You can include as many columns as needed to get the desired level of stratification.

Note that you need to ensure that the stratification_columns you use are present in your dataset and have appropriate values for stratification. Also, keep in mind that stratification may not always be possible or desirable depending on your use case and data distribution, so you should carefully consider whether it is appropriate for your specific situation.

View solution in original post

Anonymous · ‎05-13-2023

@Jared Webb :

Yes, it is possible to use a stratified sampling strategy for the train/test/validate splits in the AutoML library. The AutoMLConfig class in the azureml.train.automl package allows you to specify a

featurization configuration, which includes a stratification_column_names parameter that you can use to specify the column(s) to stratify on. Here is an example code snippet that shows how to use stratified sampling with AutoML:

from azureml.core import Dataset
from azureml.train.automl import AutoMLConfig
 
# Load your dataset from a registered dataset in AzureML
dataset = Dataset.get_by_name(workspace, dataset_name)
 
# Specify the column(s) to stratify on
stratification_columns = ['group']
 
# Configure AutoML to use stratified sampling
automl_config = AutoMLConfig(
    task='classification',
    primary_metric='accuracy',
    training_data=dataset,
    label_column_name='label',
    featurization='auto',
    n_cross_validations=5,
    stratification_column_names=stratification_columns,
    compute_target=compute_target,
    max_concurrent_iterations=4,
    experiment_timeout_minutes=30,
    enable_early_stopping=True
)
 
# Run AutoML
automl_run = experiment.submit(automl_config)

In this example, stratification_columns is a list of column names to use for stratification. You can include as many columns as needed to get the desired level of stratification.

Note that you need to ensure that the stratification_columns you use are present in your dataset and have appropriate values for stratification. Also, keep in mind that stratification may not always be possible or desirable depending on your use case and data distribution, so you should carefully consider whether it is appropriate for your specific situation.

Anonymous · ‎05-20-2023

HI @Jared Webb

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

AutoML with Stratified Sampling

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences