IvanK
New Contributor III

Hello,

I saw this post earlier this year as I was stuck something similar. I have recently managed to train XGBoost models with approximately 120 features and up to 1 200 000 000 rows using GPUs (took around 7min, with 50 boosting rounds, using 6 H100 GPUs). 

I have been in contact with Databricks and I was told that Ray is the way to go, so I have been using that together with GPUs. Even though I have so far "only" worked with a maximum of 1 200 000 000 rows of data, I would assume that the approach for billions and tens billions of data points would be the same.

I also create an issue in xgboosts github repo [1], where I asked about training XGBoost on large datasets, and I got some valuable information there. It also contains the code I used for hyper parameter search, but it can probably be made more efficient for larger datasets by using Ray data sharding instead of the materialization.

Here is what I have learned that could potentially be of help:

GPUs can be used with XGBoost.

Ray's documentation for XGBoost is a good starting point [2]. Key take aways for large datasets:

  1. Do not use Pandas dataframe
  2. Do not materialize the Ray dataset
  3. Create a train function that runs xgboost.train as they say here [3]
  4. In this training function, use ray.train.get_dataset_shard function
    1. To my understanding this makes sure each worker gets a piece of the the data, reducing the required RAM memory for each GPU
  5. Use several workers, to distribute the data between them (1 worker per GPU)

XGBoost has something called QuantileDMatrix, which was "primarily designed to reduce the required GPU memory for training on distributed environment" [4]. Use this with a custom iterator, instead of XGBoost DMatrix

 

[1] https://github.com/dmlc/xgboost/issues/11592

[2] https://docs.ray.io/en/latest/train/getting-started-xgboost.html#get-started-with-distributed-traini...

[3] https://docs.ray.io/en/latest/train/getting-started-xgboost.html#set-up-a-training-function 

[4] https://xgboost.readthedocs.io/en/stable/python/examples/quantile_data_iterator.html#demo-for-using-...