Hi @Suheb
This happens because during training the entire dataset or large intermediate objects are being loaded into the driver or executor memory, which can exceed the available memory on the cluster, especially when using large DataFrames, collecting data to the driver, or using algorithms that are not fully distributed. MLflow itself does not manage memory, it only tracks experiments, so the out of memory error comes from Spark or the underlying ML library. To fix this, you should avoid using collect or toPandas on large datasets, use distributed Spark ML algorithms instead of single node libraries when possible, increase cluster memory or use more executors, cache only what is necessary, and consider sampling or incremental training for very large datasets. Databricks also recommends monitoring memory usage with the Spark UI and following their best practices for large scale machine learning and memory management as described in the Databricks ML and Spark optimization documentation.
Mukul Chauhan