Databricks Community

tkfm_s · 3 weeks ago

I am developing a LightGBM model on Databricks, and I am using the Native API because it offers the widest range of options and allows me to try various approaches.

The training data is loaded from a table in the Catalog as a Spark DataFrame. However, my understanding is that when using the Native API, the data needs to be converted into a pandas DataFrame.

Due to project constraints, the available memory is limited to 32 GB and I cannot use AutoML. When the training data is large, this pandas conversion process can result in an out-of-memory error.

For example, I tried splitting the data into column-wise chunks and converting each chunk to pandas as suggested by Genie code, but this caused misalignment along the row dimension.

Because the project schedule is extremely tight and I do not have much time to investigate this deeply, I would appreciate any good solutions or best practices.

By the way, I have heard that there is Spark LightGBM. If I use that, is it possible to train a model without converting the data to pandas?

JAHNAVI · 3 weeks ago

@tkfm_s
Yes, using SynapseML's LightGBMClassifier / LightGBMRegressor lets you train directly on a Spark DataFrame, no pandas conversion required and also ensure partitions match executor cores so LightGBM uses them all. And if you have wide range of columns it is advised to decrease them to avoid OOM.

Attaching the document for lightgbm distributed training:
https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

Jahnavi N