Pyspark ML tools

DylanStout — Thu, 27 Mar 2025 13:48:33 GMT

Cluster policies not letting us use Pyspark ML tools
Issue details: We have clusters available in our Databricks environment and our plan was to use functions and classes from "pyspark.ml" to process data and train our model in parallel across cores/nodes. However, it looks like we are having trouble with our cluster policies.

When I attempt to run my code on a cluster with policy "autoscale_passthrough", I get the following Py4JError:

Py4JError: An error occurred while calling o389.getLocalProperty. Trace:
py4j.security.Py4JSecurityException: Method public java.lang.String org.apache.spark.api.java.JavaSparkContext.getLocalProperty(java.lang.String) is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext

From reading about this error, it seems to be a known issue with security settings on high-concurrency Databricks clusters. When I attempt to run my code on a cluster with policy "autoscale_no_isolation", the jobs will parallelize properly, but the code will almost immediately crash with a java.utils.concurrentmodificationexception error, meaning that two or more jobs tried to modify the same data.

Basically, it seems like one policy is overly restrictive and won't let us use multiple cores, while the other policy is too open and makes our code non-thread-safe so it crashes constantly. Is there anything we can do to safely run our ML code in parallel across multiple nodes?

Re: Pyspark ML tools

Louis_Frolio — Wed, 29 Oct 2025 10:20:45 GMT

Hey @DylanStout , Thanks for laying out the symptoms clearly—this is a classic clash between Safe Spark (shared/high-concurrency) protections and multi-threaded/driver-mutating code paths.

What’s happening

On clusters with the Shared/Safe Spark access mode (your “autoscale_passthrough” policy likely enforces Table ACLs and Safe Spark), Py4J calls are allowlisted. Calls like JavaSparkContext.getLocalProperty(...) are blocked unless explicitly whitelisted, which surfaces the py4j.security.Py4JSecurityException you’re seeing.
There have been fixes in DBR to reduce specific breakages of this error, but the underlying Safe Spark policy still applies allowlisting, so unwhitelisted Java calls (including some that certain pyspark paths rely on) remain blocked in Shared mode.
Separately, on “autoscale_no_isolation” (No Isolation Shared), you observed a ConcurrentModificationException. That usually means code is mutating shared driver-side state or collections while iterating (or executing multi-threaded operations that modify shared objects), rather than letting Spark do the distribution via tasks. Spark ML algorithms themselves do parallelize across executors; crashes typically appear when user code adds multi-threading or mutable shared objects around them.
There is a known gap where some pyspark.ml constructors fail under Table ACLs/Safe Spark (for example, VectorAssembler not being whitelisted), so even ML built-ins may be blocked in Shared mode when ACLs are enabled.

Safe ways to run pyspark.ml in parallel across nodes

If your goal is to safely parallelize ML on Spark, the most reliable route is to avoid Shared/Safe Spark during training and rely on Spark’s own distributed execution (no Python threads, no shared mutable state):

Prefer Single User (Assigned) access mode for training jobs. This avoids Safe Spark Py4J allowlisting, unblocks normal pyspark.ml usage, and still gives Spark parallelism across cores and nodes.
Use Spark ML parallelism controls rather than Python threads:
- CrossValidator and TrainValidationSplit provide distributed evaluation; set parallelism to spread hyperparameter evaluation across executors safely.
- Example:

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

assembler = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled", withMean=True, withStd=True)
lr = LogisticRegression(featuresCol="scaled", labelCol="label")

pipe = Pipeline(stages=[assembler, scaler, lr])

param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam, [0.0, 0.1, 0.5])
              .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
              .build())

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")

cv = (CrossValidator(estimator=pipe,
                     estimatorParamMaps=param_grid,
                     evaluator=evaluator,
                     numFolds=3)
      .setParallelism(8))  # safely parallel across executors

cv_model = cv.fit(train_df)

Keep transformations and training pure and declarative:
- No Python multi-threading against Spark objects.
- No mutation of shared Python/Java collections during iteration.
- One SparkSession / SparkContext used from the driver thread.
- Avoid driver-side shared state; use DataFrames, Broadcasts, Accumulators carefully, and let Spark schedule tasks.

If you must stay in Shared/Safe Spark

If policy or governance requires Shared mode/Table ACLs:

Ensure you’re on a DBR version with recent Safe Spark fixes (e.g., 14.3.2+ and later maintenance for some paths). This won’t remove allowlisting but can reduce specific failures introduced by regressions.
Restrict ML usage to APIs that are known to be whitelisted in your runtime. Be aware that some pyspark.ml constructors may still be blocked with ACLs enabled; the documented gap with VectorAssembler under ACLs is one example.
Consider a workflow pattern:
- Use Shared clusters for interactive exploration.
- Submit training as Jobs to Single User (Assigned) compute where pyspark.ml is fully functional.
- Persist features/training sets to Delta; training job reads them and writes back models and metrics.

Alternative distributed training options

If you’re open to non-Spark ML approaches, Databricks recommends TorchDistributor (PyTorch) or Ray for distributed training when possible; these avoid the Safe Spark ML gaps and are well-supported for parallel scale-out.

Practical checklist to resolve your case

Run training on a Single User cluster with the latest DBR LTS that your workspace offers, and avoid Python threads; use CrossValidator.setParallelism(...) for safe executor-level parallelism.
If Shared mode is mandatory, validate your code paths against the allowlist; expect some pyspark.ml components to be blocked with Table ACLs and plan job-based training on Single User compute when needed.
Review your code for driver-side mutation:
- No modification of Python lists/dicts while iterating.
- No shared mutable objects across parallel tasks.
- Avoid calling context methods like getLocalProperty from Python (these are blocked in Shared mode and prone to fail).

Hope this helps, Louis.