Databricks Community

jon-cheung

Problem Statement

Technologies used: Ray, GPUs, Unity Catalog, MLflow, XGBoost

For many data scientists, eXtreme Gradient Boosting (XGBoost) remains a popular algorithm for tackling regression and classification problems on tabular data. XGBoost is downloaded roughly 1.5 million times daily, and Kaggle’s 2025 Survey ranks it as a top algorithm for tabular datasets. Ultimately, XGBoost delivers powerful performance with remarkable ease of use.

However, growing data volumes in the enterprise demands a solution that can scale training XGBoost models. Scaling means reducing training times and handling large datasets (e.g., greater than 50M rows).

XGBoost model training can be accelerated using graphics processing units (GPUs); in some instances GPUs can train 10-20x faster compared to central processing units (CPUs). However, GPU memory (i.e., VRAM) is much smaller compared to CPU memory (i.e. RAM), limiting the amount of training data that can be used.

There exists two approaches to solving the memory bottleneck of GPUs for XGBoost: distributed data parallel (DDP) and out-of-core GPU training. First, and most commonly, machine learning practitioners use DDP to shard a dataset across multiple GPUs for training. On Databricks, data sharding and synchronization of weights and buffers for DDP can be implemented through frameworks like Ray. Secondly, GPU memory can be extended to leverage CPU RAM or disk in an approach called out-of-core training (Ou 2020). This paper demonstrates that the out-of-core implementation allows for training on a dataset that is much larger than the GPU VRAM without sacrificing accuracy. This implementation was released in XGBoost 3.0. DDP and out-of-core training can be implemented together to efficiently train on large datasets with limited GPU resources.

This blog post will explore using Ray on Databricks for scaling XGBoost with GPUs to large datasets and out-of-core training to reduce memory pressure.

How is Ray on Databricks used to scale XGBoost?

Ray has various high-level libraries to scale machine learning workloads for DDP and hyperparameter tuning (HPO): Ray Data, Ray Train, and Ray Tune, all of which can be run on Databricks. Ray Data creates a distributed dataset that is piped into Ray Train for DDP. Ray Tune layers on top for parallelized HPO. See here for starting a Ray cluster on Databricks Classic Compute and Serverless GPU Compute.

DDP with Ray begins with a trainable, a Python function that encapsulates the logic for training. For XGBoost, this can be a function that incorporates basic data preprocessing and training the model.

In cases of large datasets, a single worker node is likely insufficient due to the memory requirements and/or long training times. A common solution is to shard that dataset across multiple workers and orchestrate the gradient collections and weight updating using DDP. DDP with XGBoost can be implemented using Ray Train’s XGBoostTrainer wrapped around the XGBoost trainable.

Lastly, performance for XGBoost models can be improved through hyperparameter tuning. Traditional implementations of HPO tuning using scikit learn only work on a single-node. Ray Tune can parallelize hyperparameter tuning by wrapping the trainable or Ray Train object and integrating with frameworks like Optuna for parallelizing training across a cluster. Furthermore, each hyperparameter run can be logged into MLflow on Databricks for experiment tracking with a simple run configuration.

A full working example for performing DDP with in-core (i.e. traditional) XGBoost and HPO can be found here. In summary, Figure 1 below depicts the nesting of the Python objects for DDP and HPO across a cluster with Ray.

Figure 1: High-level view for implementing DDP with Ray for XGBoost. Begin with a trainable, a Python function that encapsulates the logic for training. If needed, add in distributed data parallel using a framework like Ray Train’s XGBoostTrainer (white). Lastly, parallelize hyperparameter tuning across a cluster using Ray Tune (cyan)

Reducing memory pressures with out-of-core GPU training.

How does XGBoost out-of-core differ from in-core?

GPU training with XGBoost requires the dataset (full or a shard of it in DDP) to reside in the GPU VRAM. The dataset is always moved from disk → CPU → GPU, causing memory spikes along the way. In the CPU a compressed matrix is created and in the GPU that matrix is optimized for GPU sparse matrix multiplication (ELLPACK matrix).

The in-core (i.e. traditional XGBoost) approach requires the dataset to be in-memory at each step. Because of that, memory pressures spike from housing both the dataset and the created compressed or ELLPACK matrix. Specifically, at the CPU, RAM spikes to 6-7x the dataset size from the full dataset plus the generated compressed matrix. At the GPU, the compressed matrix from the CPU plus the created ELLPACK matrix causes a memory spike to ~1.5x the dataset size. This traditional implementation creates an unnecessary demand for memory that can be alleviated by the out-of-core approach.

The out-of-core approach reduces that memory spike through two innovations: batch loading data to iteratively create the matrices and leveraging the CPU to stage chunks of the ELLPACK matrix generated from the GPU. This ultimately results in reducing CPU RAM from 6-7x → 2x and GPU VRAM from 1.5x → 0.5x.

In-Core

Out-of-Core

Total estimated memory requirements

CPU RAM: ~7x the dataset size

GPU VRAM: ~1.5x the dataset size

CPU RAM: ~2x the dataset size

GPU VRAM: ~0.5x the dataset size

Implementing XGBoost out-of-core training

Out-of-core GPU training means implementing a data iterator that passes in batches data to XGBoost data structures (i.e., DMatrix or ExtMemQuantileDMatrix). After that, training remains similar to the in-core implementation where you pass the data structure to a train function.

The data iterator requires implementing two class methods: next and reset. next advances the iterator by one step and passes the batch to the XGBoost data structures. reset sets the internal counter back to zero for the multiple passes across the dataset for data structure construction and training. A base example for the iterator can be viewed here on XGBoost documentation.

The base example can be modified for GPU computing and integrated with Ray for DDP. This means using NVIDIA’s RAPIDS Memory Manager to stream data from RAM to the GPU and calling the distributed Ray Data object in the XGBoost data iterator.This data iterator can be integrated with the trainable, a Python function that encapsulates our training logic, for the out-of-core implementation. See the highlighted sections for where the iterator is implemented with the Ray Data and XGBoost data structure.

Lastly, similar to the in-core implementation, this trainable can be integrated with the Ray Train XGBoostTrainer for DDP and Ray Tune for parallelized HPO tuning.

A full working example for performing DDP with out-of-core GPU training and HPO can be found here for Databricks Classic Compute and serverless GPU compute.

Summary

This blog implements an end-to-end solution for scaling XGBoost with GPUs to large datasets through DDP and out-of-core training. DDP with Ray provides a framework for distributed training across a cluster of GPUs and the out-of-core implementation reduces the required CPU RAM (from 7x to 2x the dataset size) and GPU VRAM (from 1.5x to 0.5x the dataset size). Given the exponential growth of enterprise data, this solution enables the scaling of one of the most powerful tree-based models to handle datasets significantly larger than the traditional implementation on a single-node. Furthermore, optimizing GPU utilization by minimizing unnecessary memory pressures is crucial, especially given the limited availability of GPUs.

Acknowledgement

We’d like to thank TJ Cycyota , Chengyin Eng, Sean Owen , Alex Miller, and Peyman Mohajerian for their invaluable feedback and suggestions.

Supplemental

Lessons from the Field

If a Ray Dataset is created using `ray.data.read_databricks_table`, you may run into issues with:

read concurrencies if you have too many worker nodes trying to read the table at once. This can be resolved by modifying the `databricks.sqlgateway.taskSlotBasedRejection.max.<warehouse-size>.concurrency` feature flag. Example <warehouse-size> is “small”
Losing connection with the warehouse if training jobs take longer than 1hour. This can be resolved by calling `.materialize()` on your Ray Dataset.

One of the reported downsides of out-of-core GPU training is increased training time. A benchmark using the Higgs dataset showed that training times for out-of-core training were approximately 70% longer than in-core using NVIDIA T4 GPUs (Ou 2020). This added latency is due to the movement of data from the CPU to the GPU and vice-versa (i.e., bus transfer) during the matrix construction and minimal variance sampling.

However, our results on modern hardware with a 100 million row by 100 column dataframe showed a drastic difference in performance. Out-of-core training (M = 172.77s, SD = 16.49s) completed trials 2.4x faster and required only 2 g5.4xlarge worker nodes (A10G with 64GB CPU RAM) for DDP compared to in-core training (M = 412.77s, SD = 15.44s) which required 5 workers nodes for DDP. Furthermore, because less workers were required for DDP with the out-of-core implementation, more HPO trials could be run in parallel. In a 10 worker node cluster, 16 HPO trials completed in ~56 minutes for in-core (2 trials in parallel) and ~11 minutes for out-of-core (5 trials in parallel).

So why did our results differ from the original paper? The difference is likely a result of the in-core vs out-of-core implementation and system bottlenecks. First, the in-core implementation for DDP with Ray necessitates materializing the dataset into a Pandas DataFrame before it can be converted to an XGBoost data structure. Contrast this with the out-of-core implementation that pipes batches of the dataset directly into a XGBoost data structure, skipping the time consuming materializing of the dataset, and leverages NVIDIA’s Rapids Memory Manager to stream data from the CPU to GPU. Secondly, system bottlenecks can occur when adding more worker nodes in DDP. As you add more workers to DDP, you decrease the gradient calculation per node but increase the network cost in shuttling data and gradients.

Our results highlight that DDP with the out-of-core approach improves scaling, specifically through faster training times, memory optimizations, and increased HPO throughput compared to in-core. The single drawback, though minor, is managing the code for the batch iterator.

Parameters for Out-of-core GPU training

Setting up XGBoost for GPU training and external memory requires tuning 3 major hyperparameters.

Hyperparameter	Recommended value	Description
device	cuda	Specifies the device to use for training. Setting this to cuda enables GPU acceleration.
tree_method	hist	The tree construction algorithm. hist works by binning continuous features into discrete buckets to dramatically reduce the number of split candidates that need to be evaluated. exact on the other hand evaluates every single point for increased precision but is costly in terms of both memory and computational time. Furthermore, exactI is not supported for GPUs.
sampling_method	gradient_based	Method for sampling data instances. gradient_based is highly efficient for GPU memory usage, especially with the hist tree method.

When using XGBoost’s `ExtMemQuantileDMatrix` data format change the below parameters if needed to optimize GPU usage + CPU memory requirements:

Parameter	Recommended value	Description
max_num_device_pages	Start with the default and increase if you have spare GPU memory.	Controls the size of a GPU cache for data pages. A larger cache reduces the need to transfer data from the CPU, speeding up training at the cost of higher GPU memory usage.
max_quantile_batches	Start with the default. Decrease if CPU memory usage is too high.	Sets the number of data batches processed at a time on the CPU for quantile calculation. Smaller values consume less CPU RAM but may result in less accurate quantile sketches.

When performing DDP using Ray Train, each worker will get a specific shard of the dataset. These shards can be batched and iterated over for XGBoost data matrix construction.

Parameter	Recommended Values	Description
batch_size	Start with the largest size that fits in GPU VRAM (e.g., 256, 512).	Controls the number of rows in each batch. Manages the trade-off between memory usage and computational speed.
prefetch_batches	Start with 1 or 2. Increase if the GPU VRAM is underutilized and you have CPU RAM to spare.	The number of batches to fetch ahead of time. Hides data loading latency by overlapping CPU and GPU work.

Databricks Community

Scaling XGBoost: How to Distribute Training with Ray and GPUs on Databricks

Problem Statement

How is Ray on Databricks used to scale XGBoost?

Reducing memory pressures with out-of-core GPU training.

How does XGBoost out-of-core differ from in-core?

Implementing XGBoost out-of-core training

Summary

Acknowledgement

Supplemental

Lessons from the Field

Parameters for Out-of-core GPU training

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks