cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Custom AutoML pipeline: Beyond StandardScaler().

sharpbetty
New Contributor II

The automated notebook pipeline in an AutoML experiment applies StandardScaler to all numerical features in the training dataset as part of the PreProcessor. See below.

sharpbetty_0-1728884608851.png

But I want a more nuanced and varied treatment of my numeric values (e.g. I have log transforms, winsorization, sigmoid transforms etc etc)

I want to either:
a) Remove all feature engineering / scaling from the automated Preprocessor and have the AutoML notebook run my features as presented to the AutoML experiment.
or
b) Edit the AutoML default pipeline to include a custom PreProcessor script to do more than simply scale every numeric value.

How can i achieve this? I can't seem to find any option to customize this in the UI AutoML Experiment setup and I've got no idea where to find code for the default pipeline that's invoked on every experiment.

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @sharpbetty 

Great question! Databricks AutoML's "glass box" approach actually gives you several options to customize preprocessing beyond the default StandardScaler. Here are two practical approaches:

Option A: Pre-process Features Before AutoML

The simplest solution is to apply your custom transformations (log transforms, winsorization, sigmoid transforms, etc.) before passing the data to AutoML. Create a preprocessing notebook where you apply all your feature engineering, save the transformed dataset, and then point AutoML to this pre-processed table. Since AutoML applies StandardScaler through a ColumnTransformer, you can include a mix of already-transformed numerical features and let AutoML handle only the basic preprocessing.

AutoML will still apply its default preprocessing pipeline, but if your features are already appropriately scaled or transformed, the additional StandardScaler step will have minimal impact, or you can include passthrough columns for features you don't want further processed.

Option B: Modify the Generated Notebooks

Databricks AutoML generates editable Python notebooks for each trial, which is the core of its glass box approach. Here's how to leverage this:

1. Run an initial AutoML experiment to generate the baseline notebooks
2. Access the generated notebooks from the Source column in the AutoML experiment UI or from the MLflow run artifacts
3. Edit the preprocessing pipeline in the generated notebook - specifically, modify or replace the ColumnTransformer and StandardScaler components with your custom transformers
4. Re-run the modified notebook with your custom preprocessing logic

The generated notebooks contain the complete source code including data preprocessing, feature engineering, model training, and hyperparameter tuning, so you have full control to customize any part of the pipeline.

Finding the Pipeline Code

The AutoML pipeline code isn't exposed in the UI setup, but it's automatically generated and accessible through the trial notebooks. For the best trial, the notebook is automatically imported to your workspace. For other trials, notebooks are saved as MLflow artifacts under the Artifacts section of each run page and can be downloaded or imported.

Unfortunately, there's no built-in UI option to customize the preprocessing pipeline before the AutoML run starts - you'll need to use one of the approaches above. The glass box approach is designed specifically to enable this kind of post-generation customization.

 

Hope this helps, Louis.