Hey @DylanStout , Thanks for laying out the symptoms clearly—this is a classic clash between Safe Spark (shared/high-concurrency) protections and multi-threaded/driver-mutating code paths.
What’s happening
- On clusters with the Shared/Safe Spark access mode (your “autoscale_passthrough” policy likely enforces Table ACLs and Safe Spark), Py4J calls are allowlisted. Calls like
JavaSparkContext.getLocalProperty(...) are blocked unless explicitly whitelisted, which surfaces the py4j.security.Py4JSecurityException you’re seeing.
-
There have been fixes in DBR to reduce specific breakages of this error, but the underlying Safe Spark policy still applies allowlisting, so unwhitelisted Java calls (including some that certain pyspark paths rely on) remain blocked in Shared mode.
-
Separately, on “autoscale_no_isolation” (No Isolation Shared), you observed a ConcurrentModificationException. That usually means code is mutating shared driver-side state or collections while iterating (or executing multi-threaded operations that modify shared objects), rather than letting Spark do the distribution via tasks. Spark ML algorithms themselves do parallelize across executors; crashes typically appear when user code adds multi-threading or mutable shared objects around them.
-
There is a known gap where some pyspark.ml constructors fail under Table ACLs/Safe Spark (for example, VectorAssembler not being whitelisted), so even ML built-ins may be blocked in Shared mode when ACLs are enabled.
Safe ways to run pyspark.ml in parallel across nodes
If your goal is to safely parallelize ML on Spark, the most reliable route is to avoid Shared/Safe Spark during training and rely on Spark’s own distributed execution (no Python threads, no shared mutable state):
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
assembler = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled", withMean=True, withStd=True)
lr = LogisticRegression(featuresCol="scaled", labelCol="label")
pipe = Pipeline(stages=[assembler, scaler, lr])
param_grid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.0, 0.1, 0.5])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
.build())
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
cv = (CrossValidator(estimator=pipe,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=3)
.setParallelism(8)) # safely parallel across executors
cv_model = cv.fit(train_df)
- Keep transformations and training pure and declarative:
- No Python multi-threading against Spark objects.
- No mutation of shared Python/Java collections during iteration.
- One
SparkSession / SparkContext used from the driver thread.
- Avoid driver-side shared state; use DataFrames, Broadcasts, Accumulators carefully, and let Spark schedule tasks.
If you must stay in Shared/Safe Spark
If policy or governance requires Shared mode/Table ACLs:
-
Ensure you’re on a DBR version with recent Safe Spark fixes (e.g., 14.3.2+ and later maintenance for some paths). This won’t remove allowlisting but can reduce specific failures introduced by regressions.
-
Restrict ML usage to APIs that are known to be whitelisted in your runtime. Be aware that some pyspark.ml constructors may still be blocked with ACLs enabled; the documented gap with VectorAssembler under ACLs is one example.
-
Consider a workflow pattern:
- Use Shared clusters for interactive exploration.
- Submit training as Jobs to Single User (Assigned) compute where pyspark.ml is fully functional.
- Persist features/training sets to Delta; training job reads them and writes back models and metrics.
Alternative distributed training options
If you’re open to non-Spark ML approaches, Databricks recommends TorchDistributor (PyTorch) or Ray for distributed training when possible; these avoid the Safe Spark ML gaps and are well-supported for parallel scale-out.
Practical checklist to resolve your case
Hope this helps, Louis.