Databricks Community

moh3th1 · ‎04-11-2025

Hi Community,

I am currently evaluating various gradient boosting options on Databricks using production-level data, including the CatBoost Spark integration (ai.catboost:catboost-spark).

I would love to hear from others who have successfully used this specific integration for production workloads. How have you found its stability and resource requirements, particularly concerning the driver, compared to alternatives like XGBoost Spark or LightGBM (via SynapseML)?

Are there any other preferred libraries or approaches for robust gradient-boosting training within the Databricks environment?

Thank you for sharing your insights!

stbjelcevic · 2 weeks ago

Hi @moh3th1 ,

I can't personally speak to using CatBoost, but I can discuss preferred libraries and recommendations per approach with various gradient-boosting libraries within Databricks.

Preferred for robust distributed GBM on Databricks: XGBoost Spark
- Use Databricks ML runtimes and xgboost.spark estimators in MLlib pipelines; set num_workers=sc.defaultParallelism, disable autoscaling, log with mlflow.spark.log_model, and enable GPU by use_gpu=True if needed (one GPU per task).
- Follow published best practices for cluster sizing, repartitioning to match workers, and avoiding driver-side data localization; and consult production case studies as a blueprint.
If you need CatBoost:
- Prefer Maven installation with the exact ai.catboost:catboost-spark coordinate (matching Spark/Scala version) so dependencies resolve correctly; if Maven is blocked, set up a private proxy rather than uploading JARs piecemeal.
- If you encounter Java 17 module-access errors tied to CatBoost Spark driver classes, validate DBR version and compute type; we’ve seen alerts only on classic compute — upgrading DBR and standardizing runtime/JDK may help; otherwise capture driver JFR + heap histograms for RCA and consider a temporary alternative for critical training windows.
If you need LightGBM specifically:
- SynapseML can work, but be mindful of native library/architecture constraints (e.g., ARM vs. AMD/x86) and use Maven-based installs; test on AMD/x86 instances if you hit lib_lightgbm.so load failures, and avoid serverless for now given RDD dependencies.
  3 sources
- If you’re flexible, Spark ML’s GBTRegressor is simpler to operate and often “good enough,” though SynapseML docs suggest it may be 10–30% slower in some scenarios.
Alternative: Ray + XGBoost-Ray (for long-term portability and serverless roadmap):
- Internal field guidance increasingly recommends XGBoost-Ray in place of xgboost.spark when teams want to align with the Ray ecosystem and future serverless support; Ray is fully supported on Databricks, with MLflow integration and complementary to Spark for data parallelism.
AutoML
- Databricks AutoML orchestrates baseline training across XGBoost and LightGBM with reproducible notebooks and MLflow logging. This is helpful when you want consistent pipelines and code artifacts without building everything from scratch.

Practical tips to minimize driver issues during GBM training:

Disable autoscaling and set num_workers to match total task slots; repartition inputs to avoid hidden repartitions by the estimators.
Avoid large broadcasts and check AQE thresholds; with a large-memory driver (32GB+), raising broadcast thresholds to ~200MB can be safe, but remember Spark’s hard 8GB broadcast limit in-memory after decompression.
Use Ganglia/Spark UI to verify causes like too few shuffle partitions, skew, window functions without PARTITION BY, and large UDF memory footprints — often these, not the algorithm, are the underlying root causes.
When troubleshooting recurrent driver unresponsiveness, follow the runbook, collect JFR, and consider DBR upgrades (several driver stability improvements landed across 15.x–16.x).

Sources:

View solution in original post

stbjelcevic · 2 weeks ago

Hi @moh3th1 ,

I can't personally speak to using CatBoost, but I can discuss preferred libraries and recommendations per approach with various gradient-boosting libraries within Databricks.

Preferred for robust distributed GBM on Databricks: XGBoost Spark
- Use Databricks ML runtimes and xgboost.spark estimators in MLlib pipelines; set num_workers=sc.defaultParallelism, disable autoscaling, log with mlflow.spark.log_model, and enable GPU by use_gpu=True if needed (one GPU per task).
- Follow published best practices for cluster sizing, repartitioning to match workers, and avoiding driver-side data localization; and consult production case studies as a blueprint.
If you need CatBoost:
- Prefer Maven installation with the exact ai.catboost:catboost-spark coordinate (matching Spark/Scala version) so dependencies resolve correctly; if Maven is blocked, set up a private proxy rather than uploading JARs piecemeal.
- If you encounter Java 17 module-access errors tied to CatBoost Spark driver classes, validate DBR version and compute type; we’ve seen alerts only on classic compute — upgrading DBR and standardizing runtime/JDK may help; otherwise capture driver JFR + heap histograms for RCA and consider a temporary alternative for critical training windows.
If you need LightGBM specifically:
- SynapseML can work, but be mindful of native library/architecture constraints (e.g., ARM vs. AMD/x86) and use Maven-based installs; test on AMD/x86 instances if you hit lib_lightgbm.so load failures, and avoid serverless for now given RDD dependencies.
  3 sources
- If you’re flexible, Spark ML’s GBTRegressor is simpler to operate and often “good enough,” though SynapseML docs suggest it may be 10–30% slower in some scenarios.
Alternative: Ray + XGBoost-Ray (for long-term portability and serverless roadmap):
- Internal field guidance increasingly recommends XGBoost-Ray in place of xgboost.spark when teams want to align with the Ray ecosystem and future serverless support; Ray is fully supported on Databricks, with MLflow integration and complementary to Spark for data parallelism.
AutoML
- Databricks AutoML orchestrates baseline training across XGBoost and LightGBM with reproducible notebooks and MLflow logging. This is helpful when you want consistent pipelines and code artifacts without building everything from scratch.

Practical tips to minimize driver issues during GBM training:

Disable autoscaling and set num_workers to match total task slots; repartition inputs to avoid hidden repartitions by the estimators.
Avoid large broadcasts and check AQE thresholds; with a large-memory driver (32GB+), raising broadcast thresholds to ~200MB can be safe, but remember Spark’s hard 8GB broadcast limit in-memory after decompression.
Use Ganglia/Spark UI to verify causes like too few shuffle partitions, skew, window functions without PARTITION BY, and large UDF memory footprints — often these, not the algorithm, are the underlying root causes.
When troubleshooting recurrent driver unresponsiveness, follow the runbook, collect JFR, and consider DBR upgrades (several driver stability improvements landed across 15.x–16.x).

Sources: