Databricks Community

AbhaySingh · 3 weeks ago

You can now run distributed ML (Spark MLlib in Python, Optuna tuning, MLflow Spark, Joblib Spark) on serverless notebooks/jobs and on standard clusters, not just dedicated ML clusters.
It reuses the same Unity Catalog + Lakeguard stack you already use for serverless SQL/ETL, so ML training inherits fine‑grained access control and multi‑user isolation.
Sweet spot: teams doing “classic” ML (Spark MLlib, scikit‑learn, XGBoost) that want faster training/tuning without managing special ML clusters.

Why This Matters in Real Life

In most shops today, the story looks like this:

Analytics is on serverless + Unity Catalog.
Serious ML = “go spin a dedicated cluster,” often with weaker access controls and bespoke configs.
Hyperparameter tuning is either single-threaded or a fragile homegrown loop.

The result: cluster sprawl, security headaches, and ML pipelines that live on an island far away from the rest of your lakehouse.

Distributed ML on serverless/standard is Databricks’ attempt to collapse that mess back into one governed platform.

What You Actually Get

On serverless compute (environment version 4+) and standard clusters (DBR 17.0+), you now have:

Spark MLlib in PySpark (pyspark.ml) on shared compute – pipelines, tree models, regressions, clustering, etc.
Optuna for distributed hyperparameter tuning using MlflowSparkStudy with MLflow‑backed storage.
MLflow Spark (mlflow.spark) to log/load Spark ML pipelines as MLflow models.
Joblib Spark so existing joblib‑based scikit‑learn/XGBoost workflows can fan out over Spark executors.

All of that runs under Unity Catalog governance and Lakeguard isolation, so shared clusters no longer mean “everyone can see everything.”

What It Looks Like in Practice

One simple pattern

Your features live in Delta tables in Unity Catalog.
You attach a notebook or job to serverless env v4 or a standard cluster on DBR 17.0.
You train/tune using Spark MLlib or scikit‑learn + Joblib Spark and track everything in MLflow.

Tiny code sketch


from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
import mlflow, mlflow.spark

df = spark.table("main_ml.credit_features_train")
feature_cols = [c for c in df.columns if c != "label"]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
gbt = GBTClassifier(featuresCol="features", labelCol="label", maxIter=50)

pipeline = Pipeline(stages=[assembler, gbt])

mlflow.set_experiment("/Shared/distributed-ml/credit-risk")

with mlflow.start_run(run_name="gbt_serverless"):
    model = pipeline.fit(df)
    mlflow.spark.log_model(model, "model")

Same idea for Optuna: wrap the training logic in an objective() function, wire it into MlflowSparkStudy, and set n_jobs > 1 so trials run across executors.

When This Is a Good Idea

Great fit

You’re already on Unity Catalog + serverless or standard compute and want ML to follow the same governance path.
Your workloads are mostly “classic” ML (regressions, trees, clustering) or scikit‑learn/XGBoost tuning.
You care about multi-user shared clusters where each user only sees the data they should.
You want to kill off some bespoke ML clusters and simplify your platform story.

Maybe not (yet)

You need unsupported Spark MLlib models (like DistributedLDAModel or FPGrowthModel) – those aren’t supported here today.
Your models are multi‑GB monsters that blow past the serverless/standard model size limits (≈100 MB per model on serverless, ≈1 GB on standard).
You’re doing very custom deep learning training and already rely on dedicated ML runtimes with TorchDistributor / Ray / DeepSpeed.

Mental Model to Keep You Sane

Think of this feature as:

“Take the compute you already trust for SQL & ETL, and teach it to do distributed ML under the same governance and cost model.”

No new cluster flavor for most use cases, less operational overhead, and a much easier story to tell security and compliance.

Fast Next Steps

Pick one existing Spark MLlib or scikit‑learn training notebook.
Attach it to a serverless env v4 (or DBR 17.0 standard) cluster and make sure it still runs.
Add basic MLflow logging (mlflow.start_run, log params/metrics, and log the model).
Wrap that training in a simple Optuna study for one important model and run a small trial count (e.g. 20).
Once it feels boring (in a good way), start decommissioning any “special” ML clusters that no longer earn their keep.