Greetings @sharpbetty
Great question! Databricks AutoML's "glass box" approach actually gives you several options to customize preprocessing beyond the default StandardScaler. Here are two practical approaches:
Option A: Pre-process Features Before AutoML
The simplest solution is to apply your custom transformations (log transforms, winsorization, sigmoid transforms, etc.) before passing the data to AutoML. Create a preprocessing notebook where you apply all your feature engineering, save the transformed dataset, and then point AutoML to this pre-processed table. Since AutoML applies StandardScaler through a ColumnTransformer, you can include a mix of already-transformed numerical features and let AutoML handle only the basic preprocessing.
AutoML will still apply its default preprocessing pipeline, but if your features are already appropriately scaled or transformed, the additional StandardScaler step will have minimal impact, or you can include passthrough columns for features you don't want further processed.
Option B: Modify the Generated Notebooks
Databricks AutoML generates editable Python notebooks for each trial, which is the core of its glass box approach. Here's how to leverage this:
1. Run an initial AutoML experiment to generate the baseline notebooks
2. Access the generated notebooks from the Source column in the AutoML experiment UI or from the MLflow run artifacts
3. Edit the preprocessing pipeline in the generated notebook - specifically, modify or replace the ColumnTransformer and StandardScaler components with your custom transformers
4. Re-run the modified notebook with your custom preprocessing logic
The generated notebooks contain the complete source code including data preprocessing, feature engineering, model training, and hyperparameter tuning, so you have full control to customize any part of the pipeline.
Finding the Pipeline Code
The AutoML pipeline code isn't exposed in the UI setup, but it's automatically generated and accessible through the trial notebooks. For the best trial, the notebook is automatically imported to your workspace. For other trials, notebooks are saved as MLflow artifacts under the Artifacts section of each run page and can be downloaded or imported.
Unfortunately, there's no built-in UI option to customize the preprocessing pipeline before the AutoML run starts - you'll need to use one of the approaches above. The glass box approach is designed specifically to enable this kind of post-generation customization.
Hope this helps, Louis.