When launching an AutoML experiment on Databricks, the default run splits the dataset randomly with 60% for training, 20% for validation, and 20% for testing. Starting from ML Runtime 15.3, users can customize the dataset split in AutoML.
Use Case #1: Explicit Sample Categorization
If you want to specify the category for each sample explicitly, you can insert a column with values “train”, “validate”, or “test”. When calling the AutoML API, pass this column to the argument split_col as shown below. The AutoML experiment will split the dataset based on this column.
from databricks import automl
summary_regress = automl.regress(df_regress, target_col="target", split_col="custom_split")
Use Case #2: Custom Split Ratios
If you prefer AutoML to split the dataset with a different ratio than the default 60:20:20, you can populate a new column with the target ratio and apply the same split_col argument as in Use Case #1. For example, to split the dataset with an 80:10:10 ratio, you can do the following:
from pyspark.sql.functions import when, rand
seed = 42 # define your seed here for reproduction
train_ratio, validate_ratio, test_ratio = 0.8, 0.1, 0.1 # define your preferred ratios here
df = df.withColumn("random", rand(seed=seed))
df = df.withColumn("custom_split", when(df.random < train_ratio, "train")
.when(df.random < 1-test_ratio, "validate")
.otherwise("test"))
df = df.drop("random")
You can also define different split ratios for different classes in a classification problem as follows:
from pyspark.sql.functions import lit
seed = 42 # define your seed here for reproduction
ratios = { # define your preferred ratios here
"1": (0.7, 0.2, 0.1), # For class 1, 70% train, 20% validate, 10% test
"2": (0.8, 0.1, 0.1) # For class 2, 80% train, 10% validate, 10% test
}
# Sample for training data
train_df = df.sampleBy("label", fractions={x: y[0] for x, y in ratios.items()}, seed=42)
train_df = train_df.withColumn("custom_split", lit("train"))
# Subtract training data from original DataFrame
remaining_df = df.join(train_df, df.columns, "left_anti")
# Sample for validation data
validate_df = remaining_df.sampleBy("label", fractions={x: y[1] / (y[1] + y[2]) for x, y in ratios.items()}, seed=42)
validate_df = validate_df.withColumn("custom_split", lit("validate"))
# Subtract validation data from remaining DataFrame
remaining_df = remaining_df.join(validate_df, df.columns, "left_anti")
# The rest is for testing
test_df = remaining_df.withColumn("custom_split", lit("test"))
# Combine all subsets
df = train_df.union(validate_df).union(test_df)
Feel free to adjust any part of this to better suit your needs!