I am developing a LightGBM model on Databricks, and I am using the Native API because it offers the widest range of options and allows me to try various approaches.
The training data is loaded from a table in the Catalog as a Spark DataFrame. However, my understanding is that when using the Native API, the data needs to be converted into a pandas DataFrame.
Due to project constraints, the available memory is limited to 32โฏGB and I cannot use AutoML. When the training data is large, this pandas conversion process can result in an out-of-memory error.
For example, I tried splitting the data into column-wise chunks and converting each chunk to pandas as suggested by Genie code, but this caused misalignment along the row dimension.
Because the project schedule is extremely tight and I do not have much time to investigate this deeply, I would appreciate any good solutions or best practices.
By the way, I have heard that there is Spark LightGBM. If I use that, is it possible to train a model without converting the data to pandas?