Memory error in LightGBM training data processing

tkfm_s
New Contributor II

I am developing a LightGBM model on Databricks, and I am using the Native API because it offers the widest range of options and allows me to try various approaches.

The training data is loaded from a table in the Catalog as a Spark DataFrame. However, my understanding is that when using the Native API, the data needs to be converted into a pandas DataFrame.

Due to project constraints, the available memory is limited to 32 GB and I cannot use AutoML. When the training data is large, this pandas conversion process can result in an out-of-memory error.

For example, I tried splitting the data into column-wise chunks and converting each chunk to pandas as suggested by Genie code, but this caused misalignment along the row dimension.

Because the project schedule is extremely tight and I do not have much time to investigate this deeply, I would appreciate any good solutions or best practices.

By the way, I have heard that there is Spark LightGBM. If I use that, is it possible to train a model without converting the data to pandas?