Pyspark ML tools

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Cluster policies not letting us use Pyspark ML tools
Issue details: We have clusters available in our Databricks environment and our plan was to use functions and classes from "pyspark.ml" to process data and train our model in parallel across cores/nodes. However, it looks like we are having trouble with our cluster policies.

When I attempt to run my code on a cluster with policy "autoscale_passthrough", I get the following Py4JError:

Py4JError: An error occurred while calling o389.getLocalProperty. Trace:
py4j.security.Py4JSecurityException: Method public java.lang.String org.apache.spark.api.java.JavaSparkContext.getLocalProperty(java.lang.String) is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext

From reading about this error, it seems to be a known issue with security settings on high-concurrency Databricks clusters. When I attempt to run my code on a cluster with policy "autoscale_no_isolation", the jobs will parallelize properly, but the code will almost immediately crash with a java.utils.concurrentmodificationexception error, meaning that two or more jobs tried to modify the same data.

Basically, it seems like one policy is overly restrictive and won't let us use multiple cores, while the other policy is too open and makes our code non-thread-safe so it crashes constantly. Is there anything we can do to safely run our ML code in parallel across multiple nodes?