Scaling XGBoost and LightGBM models to handle exceptionally large datasets—those comprising billions to tens of billions of rows—presents a formidable computational challenge, particularly when constrained by the limitations of in-memory processing on a single, albeit large, EC2 instance. The conventional, in-memory training paradigm of these gradient-boosted decision tree algorithms becomes inefficient, necessitating a shift in strategy for distributed environments, such as Databricks.
In addition to standard scaling techniques, there are specific algorithmic features integral to the performance and optimization of these models that must be preserved during large-scale training. These include the capability for early stopping during iterative training, the ability to efficiently manage sparse feature representations or handle high-cardinality categorical variables natively without resorting to infeasible one-hot encoding approaches, and the imposition of monotonicity constraints on certain features, which is critical for maintaining interpretability and performance in certain domains.
A further challenge is the ability to dynamically adjust the learning rate mid-training, particularly for datasets of this scale. This allows for the continuation of training after an initial convergence with a small learning rate, which, while it ensures precise improvement, can lead to slow convergence rates, affecting model performance by generating an unnecessarily large number of trees. Such a situation leads to a model with excessive complexity and slower inference times.
Current attempts, spanning various approaches, have proven unsatisfactory, often constrained by the memory-bound nature of XGBoost's algorithm, which is less amenable to batch-based optimizations, such as those employed by stochastic gradient descent (SGD) in neural networks. While GPU-based training has shown promise in smaller contexts, its applicability to the massive-scale datasets I am working with remains dubious, as XGBoost's underlying structure is inherently constrained by memory limitations.
Given these complexities, I seek an optimal, best-practice solution for distributed XGBoost/LightGBM training in environments such as Databricks, where resource constraints and algorithmic requirements must be delicately balanced.