cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoML master notebook failing

dkxxx-rc
Contributor

I have recently been able to run AutoML successfully on a certain dataset.  But it has just failed on a second dataset of similar construction, before being able to produce any machine learning training runs or output.  The Experiments page says

```Model training failed
For more information, visit the AutoML job run. 
An unknown error occurred```

The phrase "AutoML job run" links to a Run of an auto-generated training notebook.  In that notebook, the failure occurs in a cell whose contents are :

dkxxxrc_0-1740403690249.png

The error statement is `A column, variable, or function parameter with name `_automl_sample_weight_0000` cannot be resolved.`.  That name `_automl_sample_weight_0000` is, of course, not from my data - it's something that AutoML is creating, or failing to create.

I am not using Feature Store or anything super-clever in the ML pipeline.  My data simply comes from a Delta Table, albeit a bigger one than when the AutoML worked successfully for me.  Call this one 50000 rows by 6000 columns.

Any suggestions for repairing?

1 ACCEPTED SOLUTION

Accepted Solutions

stbjelcevic
Databricks Employee
Databricks Employee

Hi @dkxxx-rc ,

Thanks for the detailed context. This error is almost certainly coming from AutoML’s internal handling of imbalanced data and sampling, not your dataset itself.

The internal column _automl_sample_weight_0000 is created by AutoML when it detects imbalance and applies class weighting/sampling; in some ML runtime versions, a bug can make AutoML reference that column before it’s properly materialized, causing “cannot be resolved.”

This shows up more often when AutoML needs to sample due to memory constraints (wide/high‑dimensional tables or insufficient per‑core memory on the worker/driver). AutoML’s sampling behavior depends strongly on memory per core, and datasets are sampled when the estimated memory exceeds available resources.

My main suggestion would be to try to reduce the total number of columns you pass to AutoML from 6000 to something significantly less. There are likely a few thousand columns that would be useless to the ML model, and preprocessing the dataset a little bit before giving it to AutoML will significantly improve the chances of AutoML being successful.

Removing low variance features and highly correlated features would be a good start.

Alternatively (and perhaps in addition to pruning the feature set), you can use clusters with significantly more memory per core - do you happen to know what your current configuration is?

View solution in original post

2 REPLIES 2

stbjelcevic
Databricks Employee
Databricks Employee

Hi @dkxxx-rc ,

Thanks for the detailed context. This error is almost certainly coming from AutoML’s internal handling of imbalanced data and sampling, not your dataset itself.

The internal column _automl_sample_weight_0000 is created by AutoML when it detects imbalance and applies class weighting/sampling; in some ML runtime versions, a bug can make AutoML reference that column before it’s properly materialized, causing “cannot be resolved.”

This shows up more often when AutoML needs to sample due to memory constraints (wide/high‑dimensional tables or insufficient per‑core memory on the worker/driver). AutoML’s sampling behavior depends strongly on memory per core, and datasets are sampled when the estimated memory exceeds available resources.

My main suggestion would be to try to reduce the total number of columns you pass to AutoML from 6000 to something significantly less. There are likely a few thousand columns that would be useless to the ML model, and preprocessing the dataset a little bit before giving it to AutoML will significantly improve the chances of AutoML being successful.

Removing low variance features and highly correlated features would be a good start.

Alternatively (and perhaps in addition to pruning the feature set), you can use clusters with significantly more memory per core - do you happen to know what your current configuration is?

I have been using all my own model construction lately rather than AutoML, so I won't have any new experiences or attempts to report in this thread.  However, your insight about what's happening under the hood is valuable and enlightening and will likely do me some good in the long run.  Thanks!