cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is spark mllib is not encouraged on the platform?/Why is ML dependent on .toPandas() on dbricks?

stochastic
New Contributor

I'm new to Spark,Databricks and am surprised about how the Databricks tutorials for ML are using pandas DF > Spark DF. Of the tutorials I've seen, most data processing is done in a distributed manner but then its just cast to a pandas dataframe. From my perspective, I was excited to use Dbricks for faster processing and training but it just feels like I'm trading off the time I gain preprocessing to eventually use `.toPandas()` somewhere

Tutorials:

- (https://docs.databricks.com/en/mlflow/end-to-end-example.html)

- Feature tables are computed in a distributed format via the FE Client, but to actually utilize them for training they are cast .toPandas() (https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-...)

 

Overall questions:

1. Am I fundamentally misunderstanding something about ML and Databricks?

2. In the case where we do use scikit-learn and single-node compute and don't really need all the distributed compute, what is the best driver/worker setup?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

You are noticing a common pattern in Databricks ML tutorials: data is often processed with Spark for scalability, but training and modeling are frequently done on pandas DataFrames using single-node libraries like scikit-learn. This workflow can be confusing for users expecting end-to-end distributed compute. Here’s why this happens and how to think about your setup:

Why Databricks Tutorials Use .toPandas()

  • Library Support: Most Python ML libraries (scikit-learn, XGBoost, LightGBM) expect a pandas DataFrame and run on a single machine. Spark MLlib is distributed, but does not match the breadth and maturity of the pandas-based ecosystem, particularly for custom modeling or advanced workflows.

  • Workflow Scalability: Spark excels at distributed data preparation/feature engineering when datasets exceed RAM or local disk. Once data is prepared, it's often small enough to fit in memory—so tutorials favor conversion to pandas for flexibility and speed in model training.

  • Interoperability: Converting to pandas lets you leverage rich visualization, analysis, and ML libraries that aren't Spark-native. For most projects, distributed compute bottlenecks are in ETL, not in fitting models unless using deep learning or truly huge datasets.

Do You Need Spark for ML Training?

  • If your full training dataset fits comfortably in RAM and your model can be trained in minutes on one machine, distributed training may be overkill.

  • For larger problems—like deep learning or massive tabular datasets—Databricks supports distributed ML frameworks (like MLlib, HorovodRunner, or distributed XGBoost) that avoid the .toPandas() step, but these often require different APIs and more configuration.

Designing the Driver/Worker Setup for Single-node ML

If you use scikit-learn or similar libraries and do most work locally, the best Databricks cluster setup is:

  • Small Driver node: Choose a driver with enough RAM and CPU for the final pandas DataFrame and model training. Often, a Standard_DS3_v2 or Standard_D4_v2 instance is enough if your data and model fit in 16-32GB RAM.

  • Minimal Workers: Set worker count to zero or one, unless you need distributed Spark for earlier data prep. If only single-node ML is needed post-ETL, keep costs low by disabling autoscaling and minimizing workers.

  • Configurability: You can adjust cluster size dynamically in the UI or with code. Transition from distributed Spark for ETL to single-node processing by controlling cluster size, or by running single-node jobs on the driver only.

When to Use Spark MLlib Instead

  • Use MLlib for truly distributed modeling, clustering, or regression on datasets that are too large to fit in driver RAM.

  • Accept the tradeoff: MLlib does not natively support all scikit-learn features but excels at scale.


If you want end-to-end distributed ML on Databricks, consider MLlib or distributed frameworks, but for flexible model prototyping, the pandas conversion remains common—even if it feels redundant after distributed ETL. Your intuition is correct: the major time savings from Databricks often come from scalable preprocessing, not always from training, unless you leverage distributed ML APIs.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now