I'm looking to scale xgboost to large datasets which won't fit in memory on a single large EC2 instance (billions to tens of billions of rows scale). I also require many of the bells & whistles of regular in-memory xgboost slash lightgbm including:
- The ability to do early-stopping
- The ability to either consume sparse feature representations, or handle categoricals natively (I have a lot of high-cardinality categoricals, so one-hot-encoding them to dense, at the billion+ row scale is not a feasible approach)
- The ability to impose monotonicity constraints
- (ideally) the ability to train for some number of iterations, then increase the learning rate, and continue to train (I'm finding with very large datasets, at low learning rates, you otherwise get into the situation of reliable, epoch-by-epoch improvements in validation loss, but that quickly gets into the 5th decimal point, so it takes a long time to reach convergence, and the resulting model has a lot of trees so inference is very slow)
I've tried a number of approaches most of which have not worked and none of which have been fully satisfactory. I'll provide details below, but my high-level question is "is there a best-practice way of doing this (on databricks) ?".
I'm certainly not an expert on GPU training, but my intuition here is that xgboost isn't really a batch-based algorithm like SGD for neural networks and to that end, we're always going to be memory-constrained and thus GPU-based approaches are not the way forward here, but happy to be told otherwise.
What have I tried?
Spark.ml's GBTClassifier:
Pros:
- Can consume sparse encodings, and thus can run as part of a broader spark pipeline
Cons:
- Doesn't seem to have any of the other functionality required, I can't even figure out how to get it to early-stop other than training it manually round-by-round
XGBoost's SparkXGBClassifier
Pros:
- Part of the XGBoost library, and thus ostensibly has the same API and all the same bells and whistles
Cons:
- Cannot consume spark's sparse feature encodings, and doesn't have the `enable_categorical` functionality yet that regular xgboost has
SynapseML's LightGBMClassifier
Pros:
- Can consume sparse encodings, and thus can run as part of a broader spark pipeline
- Apparently has all of the bells and whistles I require
Cons:
- I haven't managed to get it to work. In particular, I can train a model on a relatively large dataset, but as soon as I add a validation column for early stopping, I start getting error messages related to being out of memory due to the `spark.driver.maxResultSize`. See https://github.com/microsoft/SynapseML/issues/2294 for more details
- Have not been able to get verbose training output to work, so it's hard to tell how long it's going to take/how well it's doing
Ray XGboost
Pros
- Markets itself as the canonical way to do this
- docs make it look really simple
- Appears to have all of the functionality that vanilla xgboost has
Cons:
- Have only been able to find documentation for really simple examples. Can get it to work for these examples, but as soon as I up the complexity (dmatrices, categorical features, early stopping), I run into all sorts of incomprehensible error messages, e.g. `TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current` (seems to be related to running out of resources, but for example I was running, there's just no way I was memory constrained)
Dask:
Dask XGBoost (running on dask-databricks):
Pros:
- Has all the functionality I need
- I've actually been able to get it to work, and have trained xgboost models on a dataset with ~2bn rows and 50 features, of which about half are categorical (hashmodded to curtail the long tail of cardinality), handling the categoricals natively
Cons:
- Not ideal to be running dask on databricks, all the ETL pipelines I have that prepare the data are in spark, so generally I'm running a pipeline, writing data to S3, then firing up a new cluster and starting a dask cluster on it
- This method still struggles when the trees get quite deep. With the aforementioned 2billion row dataset, I've found that I can get it to run for maxdepths up to about 12, and after that I often run into 137 OOM errors, 3 hours into training. The problem with this is that the whole point of scaling up to these sorts of data sizes is that the optimal maxdepth for the data will increase, and this is how to actually eek out more ML improvement. I can see on my test set that it's likely that more depth will give me better test loss
- Also no verbose training output, quite is pretty annoying when you're training a model for multiple hours (note, I was able to get verbose training output when I ran it in vanilla dask rather than dask-databricks, but I found vanilla dask to be extremely unreliable on databricks, workers constantly terminating and restarting, etc)
- 2 billion rows is pretty slow, even on a pretty big cluster (~120DBUs/hr), dask starts to complain about the size of the graph (especially at deeper maxdepths), this doesn't "feel" like it'll scale to the ten billion+ row scale. Struggles noticeably when I curtail my categorical feature cardinality at say, 5000 rather than say, 100
Dask LightGBM
Pros:
- Ostensibly has all of the functionality I need
- Seeing as I can get dask-xgb to work, and lgbm trains faster than xgboost, it's reasonable to hope that when we scale to really big datasets, these train-time differences will become very material
Cons:
- Can't get it to work, getting rather incomprehensible errors of the form `FutureCancelledError: operator.itemgetter(1)-aedec8478dd062943dfc5db591c68b4c cancelled for reason: unknown`. See https://github.com/microsoft/LightGBM/issues/6660 for more details
So in conclusion, I've only managed to get xgboost.dask running on dask-databricks to scale to a dataset meaningfully larger than anything I can fit in memory (can scale to ~100million rows in memory on a large EC2 instance reasonably comfortably in both xgboost and lgbm), and it's far from an optimal solution, so still on the hunt for the best way to do this.