Databricks Community

Lanz · ‎10-17-2024

When launching an AutoML experiment on Databricks, the default run splits the dataset randomly with 60% for training, 20% for validation, and 20% for testing. Starting from ML Runtime 15.3, users can customize the dataset split in AutoML.

Use Case #1: Explicit Sample Categorization

If you want to specify the category for each sample explicitly, you can insert a column with values “train”, “validate”, or “test”. When calling the AutoML API, pass this column to the argument split_col as shown below. The AutoML experiment will split the dataset based on this column.

from databricks import automl
summary_regress = automl.regress(df_regress, target_col="target", split_col="custom_split")

Use Case #2: Custom Split Ratios

If you prefer AutoML to split the dataset with a different ratio than the default 60:20:20, you can populate a new column with the target ratio and apply the same split_col argument as in Use Case #1. For example, to split the dataset with an 80:10:10 ratio, you can do the following:

from pyspark.sql.functions import when, rand

seed = 42 # define your seed here for reproduction
train_ratio, validate_ratio, test_ratio = 0.8, 0.1, 0.1 # define your preferred ratios here

df = df.withColumn("random", rand(seed=seed))
df = df.withColumn("custom_split", when(df.random < train_ratio, "train")
                                    .when(df.random < 1-test_ratio, "validate")
                                    .otherwise("test"))
df = df.drop("random")

You can also define different split ratios for different classes in a classification problem as follows:

from pyspark.sql.functions import lit

seed = 42 # define your seed here for reproduction
ratios = { # define your preferred ratios here
   "1": (0.7, 0.2, 0.1),  # For class 1, 70% train, 20% validate, 10% test
   "2": (0.8, 0.1, 0.1)   # For class 2, 80% train, 10% validate, 10% test
}

# Sample for training data
train_df = df.sampleBy("label", fractions={x: y[0] for x, y in ratios.items()}, seed=42)
train_df = train_df.withColumn("custom_split", lit("train"))

# Subtract training data from original DataFrame
remaining_df = df.join(train_df, df.columns, "left_anti")

# Sample for validation data
validate_df = remaining_df.sampleBy("label", fractions={x: y[1] / (y[1] + y[2]) for x, y in ratios.items()}, seed=42)
validate_df = validate_df.withColumn("custom_split", lit("validate"))

# Subtract validation data from remaining DataFrame
remaining_df = remaining_df.join(validate_df, df.columns, "left_anti")

# The rest is for testing
test_df = remaining_df.withColumn("custom_split", lit("test"))

# Combine all subsets
df = train_df.union(validate_df).union(test_df)

Feel free to adjust any part of this to better suit your needs!

Databricks Community

💡 ML Training Tip Of The Week #2 - Custom Dataset Split in AutoML

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL