cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Insufficient privileges:User does not have permission SELECT on any file

GeKo
New Contributor III

Hello,

after switching to "shared cluster" usage a python job is failing with error message:

 

 

Py4JJavaError: An error occurred while calling o877.load.
: org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file.

 

 

This error happens on the attempt of reading messages from a Kafka topic, according to the stacktrace (in the spark method spark_.read) =>

 

 

    288 else:
    289     raw_df = (
    290         self.spark_.read.format("kafka")
    291         .option(
    292             "kafka.bootstrap.servers",
    293             self.kafka_secrets.kafka_bootstrap_servers,
    294         )
    295         .option("subscribe", topic.topic)
    296         .option("groupIdPrefix", topic.consumer_group_prefix)
    297         .option("startingOffsets", "earliest")
    298         .option("failOnDataLoss", "false")
    299         .option("includeHeaders", "true")
    300         .options(**self.sasl_ssl_auth_options)
    301         .options(**spark_opts)
--> 302         .load()
    303     ).drop("timestampType")

 

 

 

 The job runs fine if "streaming" is enabled, means we use spark_.readStream instead.

What exactly is raising the "INSUFFICIENT_PERMISSIONS" error, at using "spark_.read" methon , and how to get rid of it ?!?!

Usually this error is thrown if someone wants to access data on DBFS or has tableACLs enabled, but both of them is not the case here.

Context:

  • using shared cluster
  • everything is managed via UnityCatalog
  • no Hive metastore is in use, table ACLs are disabled
  • the job does not interact with any data from DBFS (it simply wants to read from Kafka), also potential checkpoints of Kafka are configured to use UC Volume
  • I know that the statement "grant select on any file..." would solve the problem, but I don't want to use it, since I explicitly do not want to allow something on DBFS which I do not want to use anyways, neither Hive metastore related stuff

Since the difference in behaviour is between using spark_.read vs spark_.readStream my guess is, that the spark_.read is internally trying to access/interact with Hive-Metastore

Any hint how to eliminate this issue is highly appreciated 😄

1 REPLY 1

GeKo
New Contributor III

Hello @Hkesharwani ,

thanks for replying.

Indeed, as I stated in the beginning of my post, the issue occurs only with shared cluster usage (single user cluster all is fine). Since I *have to* switch to shared cluster (rowlevel security is only available there atm.), it would be great if someone provides any insights of what is causing this issue on shared clusters.