Databricks Community

GeKo · ‎05-10-2024

Hello,

after switching to "shared cluster" usage a python job is failing with error message:

Py4JJavaError: An error occurred while calling o877.load.
: org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file.

This error happens on the attempt of reading messages from a Kafka topic, according to the stacktrace (in the spark method spark_.read) =>

    288 else:
    289     raw_df = (
    290         self.spark_.read.format("kafka")
    291         .option(
    292             "kafka.bootstrap.servers",
    293             self.kafka_secrets.kafka_bootstrap_servers,
    294         )
    295         .option("subscribe", topic.topic)
    296         .option("groupIdPrefix", topic.consumer_group_prefix)
    297         .option("startingOffsets", "earliest")
    298         .option("failOnDataLoss", "false")
    299         .option("includeHeaders", "true")
    300         .options(**self.sasl_ssl_auth_options)
    301         .options(**spark_opts)
--> 302         .load()
    303     ).drop("timestampType")

The job runs fine if "streaming" is enabled, means we use spark_.readStream instead.

What exactly is raising the "INSUFFICIENT_PERMISSIONS" error, at using "spark_.read" methon , and how to get rid of it ?!?!

Usually this error is thrown if someone wants to access data on DBFS or has tableACLs enabled, but both of them is not the case here.

Context:

using shared cluster
everything is managed via UnityCatalog
no Hive metastore is in use, table ACLs are disabled
the job does not interact with any data from DBFS (it simply wants to read from Kafka), also potential checkpoints of Kafka are configured to use UC Volume
I know that the statement "grant select on any file..." would solve the problem, but I don't want to use it, since I explicitly do not want to allow something on DBFS which I do not want to use anyways, neither Hive metastore related stuff

Since the difference in behaviour is between using spark_.read vs spark_.readStream my guess is, that the spark_.read is internally trying to access/interact with Hive-Metastore

Any hint how to eliminate this issue is highly appreciated 😄

Hkesharwani · ‎05-15-2024

Hi, The reason for this issue could be shared cluster, Unity catalog best supports with personal cluster or job clusters.
I would suggest try using personal cluster.
Check out the below article this might help
https://community.databricks.com/t5/data-engineering/create-table-using-a-location/td-p/68725

Harshit Kesharwani
Data engineer at Rsystema

GeKo · ‎05-15-2024

Hello @Hkesharwani ,

thanks for replying.

Indeed, as I stated in the beginning of my post, the issue occurs only with shared cluster usage (single user cluster all is fine). Since I *have to* switch to shared cluster (rowlevel security is only available there atm.), it would be great if someone provides any insights of what is causing this issue on shared clusters.

sravs_227 · ‎06-07-2024

hey @GeKo

did you get any solution ?

GeKo · ‎07-08-2024

Hi @sravs_227 ,
the issue was, that the checkpoint directory (while reading from kafka) was set to a dbfs folder. We switched this now to also UC volume