cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark ML tools

DylanStout
Contributor

Cluster policies not letting us use Pyspark ML tools
Issue details: We have clusters available in our Databricks environment and our plan was to use functions and classes from "pyspark.ml" to process data and train our model in parallel across cores/nodes. However, it looks like we are having trouble with our cluster policies.

When I attempt to run my code on a cluster with policy "autoscale_passthrough", I get the following Py4JError:

Py4JError: An error occurred while calling o389.getLocalProperty. Trace:
py4j.security.Py4JSecurityException: Method public java.lang.String org.apache.spark.api.java.JavaSparkContext.getLocalProperty(java.lang.String) is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext


From reading about this error, it seems to be a known issue with security settings on high-concurrency Databricks clusters. When I attempt to run my code on a cluster with policy "autoscale_no_isolation", the jobs will parallelize properly, but the code will almost immediately crash with a java.utils.concurrentmodificationexception error, meaning that two or more jobs tried to modify the same data.

Basically, it seems like one policy is overly restrictive and won't let us use multiple cores, while the other policy is too open and makes our code non-thread-safe so it crashes constantly. Is there anything we can do to safely run our ML code in parallel across multiple nodes?

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now