cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why does my MLflow model training job fail on Databricks with an out‑of‑memory error for large datas

Suheb
Contributor

I am trying to train a machine learning model using MLflow on Databricks. When my dataset is very large, the training stops and gives an ‘out-of-memory’ error. Why does this happen and how can I fix it?

1 REPLY 1

mukul1409
New Contributor

Hi @Suheb 

This happens because during training the entire dataset or large intermediate objects are being loaded into the driver or executor memory, which can exceed the available memory on the cluster, especially when using large DataFrames, collecting data to the driver, or using algorithms that are not fully distributed. MLflow itself does not manage memory, it only tracks experiments, so the out of memory error comes from Spark or the underlying ML library. To fix this, you should avoid using collect or toPandas on large datasets, use distributed Spark ML algorithms instead of single node libraries when possible, increase cluster memory or use more executors, cache only what is necessary, and consider sampling or incremental training for very large datasets. Databricks also recommends monitoring memory usage with the Spark UI and following their best practices for large scale machine learning and memory management as described in the Databricks ML and Spark optimization documentation.

Mukul Chauhan