You are noticing a common pattern in Databricks ML tutorials: data is often processed with Spark for scalability, but training and modeling are frequently done on pandas DataFrames using single-node libraries like scikit-learn. This workflow can be confusing for users expecting end-to-end distributed compute. Here’s why this happens and how to think about your setup:
Why Databricks Tutorials Use .toPandas()
-
Library Support: Most Python ML libraries (scikit-learn, XGBoost, LightGBM) expect a pandas DataFrame and run on a single machine. Spark MLlib is distributed, but does not match the breadth and maturity of the pandas-based ecosystem, particularly for custom modeling or advanced workflows.
-
Workflow Scalability: Spark excels at distributed data preparation/feature engineering when datasets exceed RAM or local disk. Once data is prepared, it's often small enough to fit in memory—so tutorials favor conversion to pandas for flexibility and speed in model training.
-
Interoperability: Converting to pandas lets you leverage rich visualization, analysis, and ML libraries that aren't Spark-native. For most projects, distributed compute bottlenecks are in ETL, not in fitting models unless using deep learning or truly huge datasets.
Do You Need Spark for ML Training?
-
If your full training dataset fits comfortably in RAM and your model can be trained in minutes on one machine, distributed training may be overkill.
-
For larger problems—like deep learning or massive tabular datasets—Databricks supports distributed ML frameworks (like MLlib, HorovodRunner, or distributed XGBoost) that avoid the .toPandas() step, but these often require different APIs and more configuration.
Designing the Driver/Worker Setup for Single-node ML
If you use scikit-learn or similar libraries and do most work locally, the best Databricks cluster setup is:
-
Small Driver node: Choose a driver with enough RAM and CPU for the final pandas DataFrame and model training. Often, a Standard_DS3_v2 or Standard_D4_v2 instance is enough if your data and model fit in 16-32GB RAM.
-
Minimal Workers: Set worker count to zero or one, unless you need distributed Spark for earlier data prep. If only single-node ML is needed post-ETL, keep costs low by disabling autoscaling and minimizing workers.
-
Configurability: You can adjust cluster size dynamically in the UI or with code. Transition from distributed Spark for ETL to single-node processing by controlling cluster size, or by running single-node jobs on the driver only.
When to Use Spark MLlib Instead
-
Use MLlib for truly distributed modeling, clustering, or regression on datasets that are too large to fit in driver RAM.
-
Accept the tradeoff: MLlib does not natively support all scikit-learn features but excels at scale.
If you want end-to-end distributed ML on Databricks, consider MLlib or distributed frameworks, but for flexible model prototyping, the pandas conversion remains common—even if it feels redundant after distributed ETL. Your intuition is correct: the major time savings from Databricks often come from scalable preprocessing, not always from training, unless you leverage distributed ML APIs.