cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Lanz
Databricks Employee
Databricks Employee

When launching an AutoML experiment on Databricks, the default run splits the dataset randomly with 60% for training, 20% for validation, and 20% for testing. Starting from ML Runtime 15.3, users can customize the dataset split in AutoML.

Use Case #1: Explicit Sample Categorization 

If you want to specify the category for each sample explicitly, you can insert a column with values “train”, “validate”, or “test”. When calling the AutoML API, pass this column to the argument split_col as shown below. The AutoML experiment will split the dataset based on this column.

 

from databricks import automl
summary_regress = automl.regress(df_regress, target_col="target", split_col="custom_split")

 

Use Case #2: Custom Split Ratios

If you prefer AutoML to split the dataset with a different ratio than the default 60:20:20, you can populate a new column with the target ratio and apply the same split_col argument as in Use Case #1. For example, to split the dataset with an 80:10:10 ratio, you can do the following:

 

from pyspark.sql.functions import when, rand

seed = 42 # define your seed here for reproduction
train_ratio, validate_ratio, test_ratio = 0.8, 0.1, 0.1 # define your preferred ratios here

df = df.withColumn("random", rand(seed=seed))
df = df.withColumn("custom_split", when(df.random < train_ratio, "train")
                                    .when(df.random < 1-test_ratio, "validate")
                                    .otherwise("test"))
df = df.drop("random")

 

You can also define different split ratios for different classes in a classification problem as follows:

 

from pyspark.sql.functions import lit

seed = 42 # define your seed here for reproduction
ratios = { # define your preferred ratios here
   "1": (0.7, 0.2, 0.1),  # For class 1, 70% train, 20% validate, 10% test
   "2": (0.8, 0.1, 0.1)   # For class 2, 80% train, 10% validate, 10% test
}

# Sample for training data
train_df = df.sampleBy("label", fractions={x: y[0] for x, y in ratios.items()}, seed=42)
train_df = train_df.withColumn("custom_split", lit("train"))

# Subtract training data from original DataFrame
remaining_df = df.join(train_df, df.columns, "left_anti")

# Sample for validation data
validate_df = remaining_df.sampleBy("label", fractions={x: y[1] / (y[1] + y[2]) for x, y in ratios.items()}, seed=42)
validate_df = validate_df.withColumn("custom_split", lit("validate"))

# Subtract validation data from remaining DataFrame
remaining_df = remaining_df.join(validate_df, df.columns, "left_anti")

# The rest is for testing
test_df = remaining_df.withColumn("custom_split", lit("test"))

# Combine all subsets
df = train_df.union(validate_df).union(test_df)

 

Feel free to adjust any part of this to better suit your needs!