Hello,
after switching to "shared cluster" usage a python job is failing with error message:
Py4JJavaError: An error occurred while calling o877.load.
: org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file.
This error happens on the attempt of reading messages from a Kafka topic, according to the stacktrace (in the spark method spark_.read) =>
288 else:
289 raw_df = (
290 self.spark_.read.format("kafka")
291 .option(
292 "kafka.bootstrap.servers",
293 self.kafka_secrets.kafka_bootstrap_servers,
294 )
295 .option("subscribe", topic.topic)
296 .option("groupIdPrefix", topic.consumer_group_prefix)
297 .option("startingOffsets", "earliest")
298 .option("failOnDataLoss", "false")
299 .option("includeHeaders", "true")
300 .options(**self.sasl_ssl_auth_options)
301 .options(**spark_opts)
--> 302 .load()
303 ).drop("timestampType")
The job runs fine if "streaming" is enabled, means we use spark_.readStream instead.
What exactly is raising the "INSUFFICIENT_PERMISSIONS" error, at using "spark_.read" methon , and how to get rid of it ?!?!
Usually this error is thrown if someone wants to access data on DBFS or has tableACLs enabled, but both of them is not the case here.
Context:
- using shared cluster
- everything is managed via UnityCatalog
- no Hive metastore is in use, table ACLs are disabled
- the job does not interact with any data from DBFS (it simply wants to read from Kafka), also potential checkpoints of Kafka are configured to use UC Volume
- I know that the statement "grant select on any file..." would solve the problem, but I don't want to use it, since I explicitly do not want to allow something on DBFS which I do not want to use anyways, neither Hive metastore related stuff
Since the difference in behaviour is between using spark_.read vs spark_.readStream my guess is, that the spark_.read is internally trying to access/interact with Hive-Metastore
Any hint how to eliminate this issue is highly appreciated 😄