04-11-2025 07:57 AM
Hi Community,
I am currently evaluating various gradient boosting options on Databricks using production-level data, including the CatBoost Spark integration (ai.catboost:catboost-spark).
I would love to hear from others who have successfully used this specific integration for production workloads. How have you found its stability and resource requirements, particularly concerning the driver, compared to alternatives like XGBoost Spark or LightGBM (via SynapseML)?
Are there any other preferred libraries or approaches for robust gradient-boosting training within the Databricks environment?
Thank you for sharing your insights!
3 weeks ago
Hi @moh3th1 ,
I can't personally speak to using CatBoost, but I can discuss preferred libraries and recommendations per approach with various gradient-boosting libraries within Databricks.
Preferred for robust distributed GBM on Databricks: XGBoost Spark
If you need CatBoost:
If you need LightGBM specifically:
Alternative: Ray + XGBoost-Ray (for long-term portability and serverless roadmap):
AutoML
Disable autoscaling and set num_workers to match total task slots; repartition inputs to avoid hidden repartitions by the estimators.
Avoid large broadcasts and check AQE thresholds; with a large-memory driver (32GB+), raising broadcast thresholds to ~200MB can be safe, but remember Spark’s hard 8GB broadcast limit in-memory after decompression.
Use Ganglia/Spark UI to verify causes like too few shuffle partitions, skew, window functions without PARTITION BY, and large UDF memory footprints — often these, not the algorithm, are the underlying root causes.
When troubleshooting recurrent driver unresponsiveness, follow the runbook, collect JFR, and consider DBR upgrades (several driver stability improvements landed across 15.x–16.x).
Sources:
3 weeks ago
Hi @moh3th1 ,
I can't personally speak to using CatBoost, but I can discuss preferred libraries and recommendations per approach with various gradient-boosting libraries within Databricks.
Preferred for robust distributed GBM on Databricks: XGBoost Spark
If you need CatBoost:
If you need LightGBM specifically:
Alternative: Ray + XGBoost-Ray (for long-term portability and serverless roadmap):
AutoML
Disable autoscaling and set num_workers to match total task slots; repartition inputs to avoid hidden repartitions by the estimators.
Avoid large broadcasts and check AQE thresholds; with a large-memory driver (32GB+), raising broadcast thresholds to ~200MB can be safe, but remember Spark’s hard 8GB broadcast limit in-memory after decompression.
Use Ganglia/Spark UI to verify causes like too few shuffle partitions, skew, window functions without PARTITION BY, and large UDF memory footprints — often these, not the algorithm, are the underlying root causes.
When troubleshooting recurrent driver unresponsiveness, follow the runbook, collect JFR, and consider DBR upgrades (several driver stability improvements landed across 15.x–16.x).
Sources:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now