topic Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for in Data Governance

Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Christine — Fri, 03 Mar 2023 13:32:09 GMT

I have been using "rdd.flatMap(lambda x:x)" for a while to create lists from columns however after I have changed the cluster to a Shared acess mode (to use unity catalog) I get the following error:

py4j.security.Py4JSecurityException: Method public org.apache.spark.rdd.RDD org.apache.spark.api.java.JavaRDD.rdd() is not whitelisted on class class org.apache.spark.api.java.JavaRDD

I have tried to solve the error by adding:

"spark.databricks.pyspark.enablePy4JSecurity false"

however I then get the following error:

"spark.databricks.pyspark.enablePy4JSecurity is not allowed when chossing an access mode"

Does anybody know how to use RDD when using a cluster for unity catalouge?

Thank you!

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Anonymous — Thu, 09 Mar 2023 03:51:24 GMT

@Christine Pedersen : Would you like to start migrating to dataframes? The DataFrame API is a more modern and optimized way to work with structured data in Spark.

The error you are encountering is related to Py4J security settings in Apache Spark. In Shared access mode, Py4J security is enabled by default for security reasons, which restricts certain methods from being called on the Spark RDD object.

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Christine — Thu, 09 Mar 2023 07:16:29 GMT

Hi @Suteja Kanuri,

In this case I am using pyspark dataframe, but I am trying to get alle values from a column in that dataframe and create a list. I am using this list to filter columns in another dataframe. (see example below):

value_list = pysparkDF.select(<column_name>).distinct().rdd.flatMap(lambda x: x).collect()

filtered_table = DF2.filter(DF2.<column_name>.isin(value_list))

But I will try to search for ways to avoid lists and keep it in dataframe format.

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Anonymous — Mon, 13 Mar 2023 06:59:45 GMT

@Christine Pedersen :

You can achieve this without collecting data into a list using Spark's built-in DataFrame operations.

You can use the join operation to filter DF2 based on the distinct values in the column from pysparkDF . Here's an example:

filtered_table = DF2.join(
    pysparkDF.select(<column_name>).distinct(),
    on=DF2.<column_name> == pysparkDF.<column_name>,
    how='inner'
)

This code will perform an inner join on DF2 and pysparkDF using the column name, which will effectively filter DF2 based on the distinct values of that column in pysparkDF. Note that this approach will return a new DataFrame rather than a list, which should be more efficient for larger datasets

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for cluster

Shivanshu_ — Tue, 06 Jun 2023 05:14:36 GMT

@Suteja Kanuri

let me know if I have to do this rdd.map on a column having json data, and then read it as a json string in pyspark!

how can I do that!!

Sample Syantx for what I'm trying to achieve on a shared cluster with the same error related to "spark.databricks.pyspark.enablePy4JSecurity"

Syntax: spark.read.json(df.rdd.map(lambda x:x[0]))

what will be the optimal alternative for the same!!

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Barmat — Fri, 04 Aug 2023 05:38:10 GMT

I have the exact same issue as @Shivanshu_ any help would be highly appreciated.

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Haiyangl104 — Mon, 07 Aug 2023 14:19:03 GMT

Try this:

# Change column_name to the actual column name:
placeholder_list = spark.sql("select column from table").collect()
desired_list = [row.column_name for row in placeholder_list]
print(desired_list)

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Sumit_Kumar — Tue, 22 Aug 2023 16:37:24 GMT

Try setting below configuration in databricks notebook, then retry. It should work.

spark.conf.set("spark.jvm.class.allowlist", "spark.databricks.pyspark.enablePy4JSecurity")

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Nithya_r — Mon, 13 Nov 2023 09:35:09 GMT

I get the same error while using repartition command in a shared cluster, works fine with single user cluster. Is there an alternative for that. Any issues with continuing single user cluster

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

283513 — Wed, 03 Jan 2024 02:36:09 GMT

this configuration does not work for me. please suggest any other solution. i do need to use rdd.mapPartitions for a data framework created from unity catalog data

df_unity_catalog.rdd.mapPartitions(an_function)

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

sue01 — Sun, 18 Feb 2024 01:05:01 GMT

Hey @283513 were you able to solve this? I am facing the same issue with using vectorAssembler with unity cluster

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

KandyKad — Mon, 26 Feb 2024 13:09:44 GMT

Faced this issue multiple times.

Solution:

1. Don't use Shared Cluster or cluster without Unity Catalog enabled for running 'rdd' queries on Databricks.

2. Instead create a Personal Cluster (Single User) with basic configuration and with Unity Catalog enabled.

3. Also for the new compute cluster in Advanced Options set the following parameters:

Under Spark Config:
- spark.databricks.driver.disableScalaOutput true
- spark.databricks.delta.preview.enabled true
Under Environment Variables:
- PYSPARK_PYTHON=/databricks/python3/bin/python3

Re-run your rdd queries with new compute cluster. It works perfectly well for me.

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

mervekilincer — Thu, 21 Mar 2024 23:32:48 GMT

faced with the same issue and working for a company, it is not possible to create a new cluster. do you have any other solution for this issue?

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

rahuja — Mon, 06 May 2024 09:21:29 GMT

was this resolved?

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

dkushari — Sat, 11 May 2024 19:30:41 GMT

Hi,

Can you use json.loads instead? Example below -

from pyspark.sql import Row import json # Sample JSON data as a list of dictionaries (similar to JSON objects) json_data_str = response.text json_data = [json.loads(json_data_str)] # Convert dictionaries to Row objects rows = [Row(**json_dict) for json_dict in json_data] # Create DataFrame from list of Row objects df = spark.createDataFrame(rows) # Show the DataFrame df.display()

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

him_agg — Wed, 29 May 2024 17:44:16 GMT

I was having a similar issue in using .rdd.map()
Solved it by adding two key value pairs in the spark config for the cluster

spark.databricks.pyspark.enablePy4JSecurity false

spark.databricks.pyspark.trustedFilesystems org.apache.spark.api.java.JavaRDD

After this I was able to read the schema of the json from the column that was read as string

json_schema = spark.read.json(df.rdd.map(lambda row: row.preferences)).schema

print(json_schema)

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Shivanshu_ — Wed, 12 Jun 2024 15:20:48 GMT

Did you tried this in a UC enabled cluster?

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

rahuja — Thu, 13 Jun 2024 11:44:09 GMT

In my case the problem was that we were trying to use SparkXGBoostRegressor and in the docs it says that it does not work on clusters with autoscaling enabled. So we just disabled autoscaling for the interactive cluster where we were testing the model and it worked like a charm 🙂

Hope it helps

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

Makal — Thu, 19 Sep 2024 12:01:42 GMT

Thanks, that solved me the issue!

Re: Cannot use RDD and cannot set "spark.databricks.pyspark.enablePy4JSecurity false" for

de-qrosh — Sun, 03 Nov 2024 14:18:04 GMT

Hello,
In the past I used

rdd.mapPartitions(lambda ...)

to call functions that access third party APIs like azure ai translate text to batch call the API and return the batched data.

How would one do this now?