cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Memory error in LightGBM training data processing

tkfm_s
Visitor

I am developing a LightGBM model on Databricks, and I am using the Native API because it offers the widest range of options and allows me to try various approaches.

The training data is loaded from a table in the Catalog as a Spark DataFrame. However, my understanding is that when using the Native API, the data needs to be converted into a pandas DataFrame.

Due to project constraints, the available memory is limited to 32โ€ฏGB and I cannot use AutoML. When the training data is large, this pandas conversion process can result in an out-of-memory error.

For example, I tried splitting the data into column-wise chunks and converting each chunk to pandas as suggested by Genie code, but this caused misalignment along the row dimension.

Because the project schedule is extremely tight and I do not have much time to investigate this deeply, I would appreciate any good solutions or best practices.

By the way, I have heard that there is Spark LightGBM. If I use that, is it possible to train a model without converting the data to pandas?

1 REPLY 1

JAHNAVI
Databricks Employee
Databricks Employee

@tkfm_s
Yes, using SynapseML's LightGBMClassifier / LightGBMRegressor lets you train directly on a Spark DataFrame, no pandas conversion required and also ensure partitions match executor cores so LightGBM uses them all. And if you have wide range of columns it is advised to decrease them to avoid OOM. 

Attaching the document for lightgbm distributed training: 
https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

Jahnavi N