<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pyspark ML tools in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-ml-tools/m-p/136528#M50588</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101716"&gt;@DylanStout&lt;/a&gt;&amp;nbsp;,&amp;nbsp; &amp;nbsp;Thanks for laying out the symptoms clearly—this is a classic clash between Safe Spark (shared/high-concurrency) protections and multi-threaded/driver-mutating code paths.&lt;/P&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;What’s happening&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;On clusters with the &lt;STRONG&gt;Shared/Safe Spark access mode&lt;/STRONG&gt; (your “autoscale_passthrough” policy likely enforces Table ACLs and Safe Spark), Py4J calls are allowlisted. Calls like &lt;CODE&gt;JavaSparkContext.getLocalProperty(...)&lt;/CODE&gt; are blocked unless explicitly whitelisted, which surfaces the &lt;CODE&gt;py4j.security.Py4JSecurityException&lt;/CODE&gt; you’re seeing.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;There have been fixes in DBR to reduce specific breakages of this error, but the underlying Safe Spark policy still applies allowlisting, so unwhitelisted Java calls (including some that certain pyspark paths rely on) remain blocked in Shared mode.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Separately, on “autoscale_no_isolation” (No Isolation Shared), you observed a &lt;STRONG&gt;ConcurrentModificationException&lt;/STRONG&gt;. That usually means code is mutating shared driver-side state or collections while iterating (or executing multi-threaded operations that modify shared objects), rather than letting Spark do the distribution via tasks. Spark ML algorithms themselves do parallelize across executors; crashes typically appear when user code adds multi-threading or mutable shared objects around them.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;There is a known gap where some &lt;STRONG&gt;pyspark.ml constructors&lt;/STRONG&gt; fail under Table ACLs/Safe Spark (for example, &lt;CODE&gt;VectorAssembler&lt;/CODE&gt; not being whitelisted), so even ML built-ins may be blocked in Shared mode when ACLs are enabled.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Safe ways to run pyspark.ml in parallel across nodes&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If your goal is to safely parallelize ML on Spark, the most reliable route is to avoid Shared/Safe Spark during training and rely on Spark’s own distributed execution (no Python threads, no shared mutable state):&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Prefer &lt;STRONG&gt;Single User (Assigned) access mode&lt;/STRONG&gt; for training jobs. This avoids Safe Spark Py4J allowlisting, unblocks normal pyspark.ml usage, and still gives Spark parallelism across cores and nodes.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Use &lt;STRONG&gt;Spark ML parallelism controls&lt;/STRONG&gt; rather than Python threads:
&lt;UL&gt;
&lt;LI&gt;&lt;CODE&gt;CrossValidator&lt;/CODE&gt; and &lt;CODE&gt;TrainValidationSplit&lt;/CODE&gt; provide distributed evaluation; set &lt;CODE&gt;parallelism&lt;/CODE&gt; to spread hyperparameter evaluation across executors safely.&lt;/LI&gt;
&lt;LI&gt;Example:&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python"&gt;from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

assembler = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled", withMean=True, withStd=True)
lr = LogisticRegression(featuresCol="scaled", labelCol="label")

pipe = Pipeline(stages=[assembler, scaler, lr])

param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam, [0.0, 0.1, 0.5])
              .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
              .build())

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")

cv = (CrossValidator(estimator=pipe,
                     estimatorParamMaps=param_grid,
                     evaluator=evaluator,
                     numFolds=3)
      .setParallelism(8))  # safely parallel across executors

cv_model = cv.fit(train_df)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;UL&gt;
&lt;LI&gt;Keep transformations and training &lt;STRONG&gt;pure and declarative&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;No Python multi-threading against Spark objects.&lt;/LI&gt;
&lt;LI&gt;No mutation of shared Python/Java collections during iteration.&lt;/LI&gt;
&lt;LI&gt;One &lt;CODE&gt;SparkSession&lt;/CODE&gt; / &lt;CODE&gt;SparkContext&lt;/CODE&gt; used from the driver thread.&lt;/LI&gt;
&lt;LI&gt;Avoid driver-side shared state; use DataFrames, Broadcasts, Accumulators carefully, and let Spark schedule tasks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;If you must stay in Shared/Safe Spark&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If policy or governance requires Shared mode/Table ACLs:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Ensure you’re on a &lt;STRONG&gt;DBR version with recent Safe Spark fixes&lt;/STRONG&gt; (e.g., 14.3.2+ and later maintenance for some paths). This won’t remove allowlisting but can reduce specific failures introduced by regressions.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Restrict ML usage to APIs that are known to be whitelisted in your runtime. Be aware that some &lt;STRONG&gt;pyspark.ml constructors&lt;/STRONG&gt; may still be blocked with ACLs enabled; the documented gap with &lt;CODE&gt;VectorAssembler&lt;/CODE&gt; under ACLs is one example.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Consider a workflow pattern:
&lt;UL&gt;
&lt;LI&gt;Use Shared clusters for interactive exploration.&lt;/LI&gt;
&lt;LI&gt;Submit &lt;STRONG&gt;training as Jobs&lt;/STRONG&gt; to &lt;STRONG&gt;Single User&lt;/STRONG&gt; (Assigned) compute where pyspark.ml is fully functional.&lt;/LI&gt;
&lt;LI&gt;Persist features/training sets to Delta; training job reads them and writes back models and metrics.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Alternative distributed training options&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If you’re open to non-Spark ML approaches, Databricks recommends &lt;STRONG&gt;TorchDistributor&lt;/STRONG&gt; (PyTorch) or &lt;STRONG&gt;Ray&lt;/STRONG&gt; for distributed training when possible; these avoid the Safe Spark ML gaps and are well-supported for parallel scale-out.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Practical checklist to resolve your case&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Run training on a &lt;STRONG&gt;Single User&lt;/STRONG&gt; cluster with the latest &lt;STRONG&gt;DBR LTS&lt;/STRONG&gt; that your workspace offers, and avoid Python threads; use &lt;CODE&gt;CrossValidator.setParallelism(...)&lt;/CODE&gt; for safe executor-level parallelism.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If Shared mode is mandatory, validate your code paths against the allowlist; expect some pyspark.ml components to be blocked with &lt;STRONG&gt;Table ACLs&lt;/STRONG&gt; and plan job-based training on Single User compute when needed.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Review your code for driver-side mutation:
&lt;UL&gt;
&lt;LI&gt;No modification of Python lists/dicts while iterating.&lt;/LI&gt;
&lt;LI&gt;No shared mutable objects across parallel tasks.&lt;/LI&gt;
&lt;LI&gt;Avoid calling context methods like &lt;CODE&gt;getLocalProperty&lt;/CODE&gt; from Python (these are blocked in Shared mode and prone to fail).&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Louis.&lt;/DIV&gt;</description>
    <pubDate>Wed, 29 Oct 2025 10:20:45 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-10-29T10:20:45Z</dc:date>
    <item>
      <title>Pyspark ML tools</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-ml-tools/m-p/113807#M44643</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Cluster policies not letting us use Pyspark ML tools&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Issue details: We have clusters available in our Databricks environment and our plan was to use functions and classes from "pyspark.ml" to process data and train our model in parallel across cores/nodes. However, it looks like we are having trouble with our cluster policies.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;When I attempt to run my code on a cluster with policy "autoscale_passthrough", I get the following Py4JError:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Py4JError: An error occurred while calling o389.getLocalProperty. Trace:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;py4j.security.Py4JSecurityException: Method public java.lang.String org.apache.spark.api.java.JavaSparkContext.getLocalProperty(java.lang.String) is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;From reading about this error, it seems to be a known issue with security settings on high-concurrency Databricks clusters. When I attempt to run my code on a cluster with policy "autoscale_no_isolation", the jobs will parallelize properly, but the code will almost immediately crash with a java.utils.concurrentmodificationexception error, meaning that two or more jobs tried to modify the same data.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Basically, it seems like one policy is overly restrictive and won't let us use multiple cores, while the other policy is too open and makes our code non-thread-safe so it crashes constantly. Is there anything we can do to safely run our ML code in parallel across multiple nodes?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2025 13:48:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-ml-tools/m-p/113807#M44643</guid>
      <dc:creator>DylanStout</dc:creator>
      <dc:date>2025-03-27T13:48:33Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark ML tools</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-ml-tools/m-p/136528#M50588</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101716"&gt;@DylanStout&lt;/a&gt;&amp;nbsp;,&amp;nbsp; &amp;nbsp;Thanks for laying out the symptoms clearly—this is a classic clash between Safe Spark (shared/high-concurrency) protections and multi-threaded/driver-mutating code paths.&lt;/P&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;What’s happening&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;On clusters with the &lt;STRONG&gt;Shared/Safe Spark access mode&lt;/STRONG&gt; (your “autoscale_passthrough” policy likely enforces Table ACLs and Safe Spark), Py4J calls are allowlisted. Calls like &lt;CODE&gt;JavaSparkContext.getLocalProperty(...)&lt;/CODE&gt; are blocked unless explicitly whitelisted, which surfaces the &lt;CODE&gt;py4j.security.Py4JSecurityException&lt;/CODE&gt; you’re seeing.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;There have been fixes in DBR to reduce specific breakages of this error, but the underlying Safe Spark policy still applies allowlisting, so unwhitelisted Java calls (including some that certain pyspark paths rely on) remain blocked in Shared mode.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Separately, on “autoscale_no_isolation” (No Isolation Shared), you observed a &lt;STRONG&gt;ConcurrentModificationException&lt;/STRONG&gt;. That usually means code is mutating shared driver-side state or collections while iterating (or executing multi-threaded operations that modify shared objects), rather than letting Spark do the distribution via tasks. Spark ML algorithms themselves do parallelize across executors; crashes typically appear when user code adds multi-threading or mutable shared objects around them.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;There is a known gap where some &lt;STRONG&gt;pyspark.ml constructors&lt;/STRONG&gt; fail under Table ACLs/Safe Spark (for example, &lt;CODE&gt;VectorAssembler&lt;/CODE&gt; not being whitelisted), so even ML built-ins may be blocked in Shared mode when ACLs are enabled.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Safe ways to run pyspark.ml in parallel across nodes&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If your goal is to safely parallelize ML on Spark, the most reliable route is to avoid Shared/Safe Spark during training and rely on Spark’s own distributed execution (no Python threads, no shared mutable state):&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Prefer &lt;STRONG&gt;Single User (Assigned) access mode&lt;/STRONG&gt; for training jobs. This avoids Safe Spark Py4J allowlisting, unblocks normal pyspark.ml usage, and still gives Spark parallelism across cores and nodes.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Use &lt;STRONG&gt;Spark ML parallelism controls&lt;/STRONG&gt; rather than Python threads:
&lt;UL&gt;
&lt;LI&gt;&lt;CODE&gt;CrossValidator&lt;/CODE&gt; and &lt;CODE&gt;TrainValidationSplit&lt;/CODE&gt; provide distributed evaluation; set &lt;CODE&gt;parallelism&lt;/CODE&gt; to spread hyperparameter evaluation across executors safely.&lt;/LI&gt;
&lt;LI&gt;Example:&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python"&gt;from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

assembler = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled", withMean=True, withStd=True)
lr = LogisticRegression(featuresCol="scaled", labelCol="label")

pipe = Pipeline(stages=[assembler, scaler, lr])

param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam, [0.0, 0.1, 0.5])
              .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
              .build())

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")

cv = (CrossValidator(estimator=pipe,
                     estimatorParamMaps=param_grid,
                     evaluator=evaluator,
                     numFolds=3)
      .setParallelism(8))  # safely parallel across executors

cv_model = cv.fit(train_df)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;UL&gt;
&lt;LI&gt;Keep transformations and training &lt;STRONG&gt;pure and declarative&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;No Python multi-threading against Spark objects.&lt;/LI&gt;
&lt;LI&gt;No mutation of shared Python/Java collections during iteration.&lt;/LI&gt;
&lt;LI&gt;One &lt;CODE&gt;SparkSession&lt;/CODE&gt; / &lt;CODE&gt;SparkContext&lt;/CODE&gt; used from the driver thread.&lt;/LI&gt;
&lt;LI&gt;Avoid driver-side shared state; use DataFrames, Broadcasts, Accumulators carefully, and let Spark schedule tasks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;If you must stay in Shared/Safe Spark&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If policy or governance requires Shared mode/Table ACLs:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Ensure you’re on a &lt;STRONG&gt;DBR version with recent Safe Spark fixes&lt;/STRONG&gt; (e.g., 14.3.2+ and later maintenance for some paths). This won’t remove allowlisting but can reduce specific failures introduced by regressions.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Restrict ML usage to APIs that are known to be whitelisted in your runtime. Be aware that some &lt;STRONG&gt;pyspark.ml constructors&lt;/STRONG&gt; may still be blocked with ACLs enabled; the documented gap with &lt;CODE&gt;VectorAssembler&lt;/CODE&gt; under ACLs is one example.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Consider a workflow pattern:
&lt;UL&gt;
&lt;LI&gt;Use Shared clusters for interactive exploration.&lt;/LI&gt;
&lt;LI&gt;Submit &lt;STRONG&gt;training as Jobs&lt;/STRONG&gt; to &lt;STRONG&gt;Single User&lt;/STRONG&gt; (Assigned) compute where pyspark.ml is fully functional.&lt;/LI&gt;
&lt;LI&gt;Persist features/training sets to Delta; training job reads them and writes back models and metrics.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Alternative distributed training options&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If you’re open to non-Spark ML approaches, Databricks recommends &lt;STRONG&gt;TorchDistributor&lt;/STRONG&gt; (PyTorch) or &lt;STRONG&gt;Ray&lt;/STRONG&gt; for distributed training when possible; these avoid the Safe Spark ML gaps and are well-supported for parallel scale-out.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Practical checklist to resolve your case&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Run training on a &lt;STRONG&gt;Single User&lt;/STRONG&gt; cluster with the latest &lt;STRONG&gt;DBR LTS&lt;/STRONG&gt; that your workspace offers, and avoid Python threads; use &lt;CODE&gt;CrossValidator.setParallelism(...)&lt;/CODE&gt; for safe executor-level parallelism.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If Shared mode is mandatory, validate your code paths against the allowlist; expect some pyspark.ml components to be blocked with &lt;STRONG&gt;Table ACLs&lt;/STRONG&gt; and plan job-based training on Single User compute when needed.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Review your code for driver-side mutation:
&lt;UL&gt;
&lt;LI&gt;No modification of Python lists/dicts while iterating.&lt;/LI&gt;
&lt;LI&gt;No shared mutable objects across parallel tasks.&lt;/LI&gt;
&lt;LI&gt;Avoid calling context methods like &lt;CODE&gt;getLocalProperty&lt;/CODE&gt; from Python (these are blocked in Shared mode and prone to fail).&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Louis.&lt;/DIV&gt;</description>
      <pubDate>Wed, 29 Oct 2025 10:20:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-ml-tools/m-p/136528#M50588</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-29T10:20:45Z</dc:date>
    </item>
  </channel>
</rss>

