10-11-2024 04:07 AM
I'm looking to scale xgboost to large datasets which won't fit in memory on a single large EC2 instance (billions to tens of billions of rows scale). I also require many of the bells & whistles of regular in-memory xgboost slash lightgbm including:
I've tried a number of approaches most of which have not worked and none of which have been fully satisfactory. I'll provide details below, but my high-level question is "is there a best-practice way of doing this (on databricks) ?".
I'm certainly not an expert on GPU training, but my intuition here is that xgboost isn't really a batch-based algorithm like SGD for neural networks and to that end, we're always going to be memory-constrained and thus GPU-based approaches are not the way forward here, but happy to be told otherwise.
What have I tried?
Spark.ml's GBTClassifier:
Pros:
Cons:
XGBoost's SparkXGBClassifier
Pros:
Cons:
SynapseML's LightGBMClassifier
Pros:
Cons:
Ray XGboost
Pros
Cons:
Dask:
Dask XGBoost (running on dask-databricks):
Pros:
Cons:
Dask LightGBM
Pros:
Cons:
So in conclusion, I've only managed to get xgboost.dask running on dask-databricks to scale to a dataset meaningfully larger than anything I can fit in memory (can scale to ~100million rows in memory on a large EC2 instance reasonably comfortably in both xgboost and lgbm), and it's far from an optimal solution, so still on the hunt for the best way to do this.
11-17-2024 02:39 PM
Very insightful writeup, thanks. I wish somebody who is experienced in large scale xgboost / lightgbm usage will share more. Encountered a similar problem to me.
11-23-2024 07:28 AM
Facing the same exact issue
06-03-2025 09:39 AM
@xgbeast Do you have any new learnings or updates from this thread?
06-03-2025 02:32 PM
unfortunately not, I've not made any progress since the original post. Best solution I've got is the dask one I outlined, but it's far from ideal
08-04-2025 03:50 AM
Hello,
I saw this post earlier this year as I was stuck something similar. I have recently managed to train XGBoost models with approximately 120 features and up to 1 200 000 000 rows using GPUs (took around 7min, with 50 boosting rounds, using 6 H100 GPUs).
I have been in contact with Databricks and I was told that Ray is the way to go, so I have been using that together with GPUs. Even though I have so far "only" worked with a maximum of 1 200 000 000 rows of data, I would assume that the approach for billions and tens billions of data points would be the same.
I also create an issue in xgboosts github repo [1], where I asked about training XGBoost on large datasets, and I got some valuable information there. It also contains the code I used for hyper parameter search, but it can probably be made more efficient for larger datasets by using Ray data sharding instead of the materialization.
Here is what I have learned that could potentially be of help:
GPUs can be used with XGBoost.
Ray's documentation for XGBoost is a good starting point [2]. Key take aways for large datasets:
XGBoost has something called QuantileDMatrix, which was "primarily designed to reduce the required GPU memory for training on distributed environment" [4]. Use this with a custom iterator, instead of XGBoost DMatrix
[1] https://github.com/dmlc/xgboost/issues/11592
[3] https://docs.ray.io/en/latest/train/getting-started-xgboost.html#set-up-a-training-function
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now