cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Christine
Contributor

I have been using "rdd.flatMap(lambda x:x)" for a while to create lists from columns however after I have changed the cluster to a Shared acess mode (to use unity catalog) I get the following error:

py4j.security.Py4JSecurityException: Method public org.apache.spark.rdd.RDD org.apache.spark.api.java.JavaRDD.rdd() is not whitelisted on class class org.apache.spark.api.java.JavaRDD

I have tried to solve the error by adding:

"spark.databricks.pyspark.enablePy4JSecurity false"

however I then get the following error:

"spark.databricks.pyspark.enablePy4JSecurity is not allowed when chossing an access mode"

Does anybody know how to use RDD when using a cluster for unity catalouge?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Christine Pedersen​ : Would you like to start migrating to dataframes? The DataFrame API is a more modern and optimized way to work with structured data in Spark.

The error you are encountering is related to Py4J security settings in Apache Spark. In Shared access mode, Py4J security is enabled by default for security reasons, which restricts certain methods from being called on the Spark RDD object.

View solution in original post

12 REPLIES 12

Anonymous
Not applicable

@Christine Pedersen​ : Would you like to start migrating to dataframes? The DataFrame API is a more modern and optimized way to work with structured data in Spark.

The error you are encountering is related to Py4J security settings in Apache Spark. In Shared access mode, Py4J security is enabled by default for security reasons, which restricts certain methods from being called on the Spark RDD object.

Hi @Suteja Kanuri​,

In this case I am using pyspark dataframe, but I am trying to get alle values from a column in that dataframe and create a list. I am using this list to filter columns in another dataframe. (see example below):

value_list = pysparkDF.select(<column_name>).distinct().rdd.flatMap(lambda x: x).collect()

filtered_table = DF2.filter(DF2.<column_name>.isin(value_list))

But I will try to search for ways to avoid lists and keep it in dataframe format.

Nithya_r
New Contributor II

I get the same error while using repartition command in a shared cluster, works fine with single user cluster. Is there an alternative for that. Any issues with continuing single user cluster

Anonymous
Not applicable

@Christine Pedersen​ :

You can achieve this without collecting data into a list using Spark's built-in DataFrame operations.

You can use the join operation to filter DF2 based on the distinct values in the column from pysparkDF . Here's an example:

filtered_table = DF2.join(
    pysparkDF.select(<column_name>).distinct(),
    on=DF2.<column_name> == pysparkDF.<column_name>,
    how='inner'
)

This code will perform an inner join on DF2 and pysparkDF using the column name, which will effectively filter DF2 based on the distinct values of that column in pysparkDF. Note that this approach will return a new DataFrame rather than a list, which should be more efficient for larger datasets

Shivanshu_
New Contributor III

@Suteja Kanuri​ 

let me know if I have to do this rdd.map on a column having json data, and then read it as a json string in pyspark!

how can I do that!!

Sample Syantx for what I'm trying to achieve on a shared cluster with the same error related to "spark.databricks.pyspark.enablePy4JSecurity"

Syntax: spark.read.json(df.rdd.map(lambda x:x[0]))

what will be the optimal alternative for the same!!

Barmat
New Contributor II

Hi

I have the exact same issue as @Shivanshu_ any help would be highly appreciated.

Haiyangl104
New Contributor III

Try this:

# Change column_name to the actual column name:
placeholder_list = spark.sql("select column from table").collect()
desired_list = [row.column_name for row in placeholder_list]
print(desired_list)

Sumit_Kumar
New Contributor III

Try setting below configuration in databricks notebook, then retry. It should work. 

spark.conf.set("spark.jvm.class.allowlist", "spark.databricks.pyspark.enablePy4JSecurity")

283513
New Contributor II

this configuration does not work for me. please suggest any other solution. i do need to use rdd.mapPartitions for a data framework created from unity catalog data

df_unity_catalog.rdd.mapPartitions(an_function)

sue01
New Contributor II

Hey @283513 were you able to solve this? I am facing the same issue with using vectorAssembler with unity cluster

KandyKad
New Contributor II

Faced this issue multiple times.

Solution:

1. Don't use Shared Cluster or cluster without Unity Catalog enabled for running 'rdd' queries on Databricks.

2. Instead create a Personal Cluster (Single User) with basic configuration and with Unity Catalog enabled.

3. Also for the new compute cluster in Advanced Options set the following parameters:

  1. Under Spark Config:
    • spark.databricks.driver.disableScalaOutput true
    • spark.databricks.delta.preview.enabled true
  2. Under Environment Variables:
    • PYSPARK_PYTHON=/databricks/python3/bin/python3

Re-run your rdd queries with new compute cluster. It works perfectly well for me.

KandyKad

faced with the same issue and working for a company, it is not possible to create a new cluster. do you have any other solution for this issue?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.