cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Addressing Memory Constraints in Scaling XGBoost and LGBM: A Comprehensive Approach for High-Volume

fiverrpromotion
New Contributor

Scaling XGBoost and LightGBM models to handle exceptionally large datasets—those comprising billions to tens of billions of rows—presents a formidable computational challenge, particularly when constrained by the limitations of in-memory processing on a single, albeit large, EC2 instance. The conventional, in-memory training paradigm of these gradient-boosted decision tree algorithms becomes inefficient, necessitating a shift in strategy for distributed environments, such as Databricks.

In addition to standard scaling techniques, there are specific algorithmic features integral to the performance and optimization of these models that must be preserved during large-scale training. These include the capability for early stopping during iterative training, the ability to efficiently manage sparse feature representations or handle high-cardinality categorical variables natively without resorting to infeasible one-hot encoding approaches, and the imposition of monotonicity constraints on certain features, which is critical for maintaining interpretability and performance in certain domains.

A further challenge is the ability to dynamically adjust the learning rate mid-training, particularly for datasets of this scale. This allows for the continuation of training after an initial convergence with a small learning rate, which, while it ensures precise improvement, can lead to slow convergence rates, affecting model performance by generating an unnecessarily large number of trees. Such a situation leads to a model with excessive complexity and slower inference times.

Current attempts, spanning various approaches, have proven unsatisfactory, often constrained by the memory-bound nature of XGBoost's algorithm, which is less amenable to batch-based optimizations, such as those employed by stochastic gradient descent (SGD) in neural networks. While GPU-based training has shown promise in smaller contexts, its applicability to the massive-scale datasets I am working with remains dubious, as XGBoost's underlying structure is inherently constrained by memory limitations.

Given these complexities, I seek an optimal, best-practice solution for distributed XGBoost/LightGBM training in environments such as Databricks, where resource constraints and algorithmic requirements must be delicately balanced.

1 REPLY 1

earntodiessaz
New Contributor II

Well, that's a superb article! Thank you for this great information, you write very well which I like very much. I am really impressed by your post. run 3

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group