topic Re: Why does my MLflow model training job fail on Databricks with an out‑of‑memory error for large d in Machine Learning

Why does my MLflow model training job fail on Databricks with an out‑of‑memory error for large datas

Suheb — Wed, 31 Dec 2025 06:52:17 GMT

I am trying to train a machine learning model using MLflow on Databricks. When my dataset is very large, the training stops and gives an ‘out-of-memory’ error. Why does this happen and how can I fix it?

Re: Why does my MLflow model training job fail on Databricks with an out‑of‑memory error for large d

mukul1409 — Wed, 31 Dec 2025 08:58:53 GMT

Hi @Suheb

This happens because during training the entire dataset or large intermediate objects are being loaded into the driver or executor memory, which can exceed the available memory on the cluster, especially when using large DataFrames, collecting data to the driver, or using algorithms that are not fully distributed. MLflow itself does not manage memory, it only tracks experiments, so the out of memory error comes from Spark or the underlying ML library. To fix this, you should avoid using collect or toPandas on large datasets, use distributed Spark ML algorithms instead of single node libraries when possible, increase cluster memory or use more executors, cache only what is necessary, and consider sampling or incremental training for very large datasets. Databricks also recommends monitoring memory usage with the Spark UI and following their best practices for large scale machine learning and memory management as described in the Databricks ML and Spark optimization documentation.

Re: Why does my MLflow model training job fail on Databricks with an out‑of‑memory error for large d

iyashk-DB — Mon, 05 Jan 2026 15:45:35 GMT

+1 to what @mukul1409 has told. Please follow the guides below to distribute the training:

https://docs.databricks.com/aws/en/machine-learning/train-model/distributed-training/spark-pytorch-d...

https://docs.databricks.com/aws/en/notebooks/source/deep-learning/torch-distributor-lightning.html